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Abstract 


Relational  learning  deals  with  the  setting  where  one  has  multiple  sources  of  data, 
each  describing  different  properties  of  the  same  set  of  entities.  We  arc  concerned 
primarily  with  settings  where  the  properties  arc  pairwise  relations  between  entities, 
and  attributes  of  entities.  We  want  to  predict  the  value  of  relations  and  attributes, 
but  relations  between  entities  violate  the  basic  statistical  assumption  of  exchangeable 
data  points,  or  entities.  Furthermore,  we  desire  models  that  scale  gracefully  as  the 
number  of  entities  and  relations  increase. 

Matrices  arc  the  simplest  form  of  relational  data;  and  we  begin  by  distilling  the 
literature  on  low-rank  matrix  factorization  into  a  small  number  of  modelling  choices. 
We  then  frame  a  large  class  of  relational  learning  problems  as  simultaneously  fac¬ 
toring  sets  of  related  matrices:  i.e.,  Collective  Matrix  Factorization.  Each  entity  is 
described  by  a  small  number  of  parameters,  and  if  an  entity  is  described  by  more  than 
one  matrix,  those  parameters  participate  in  multiple  matrix  factorizations.  Maximum 
likelihood  estimation  of  the  resulting  model  involves  a  large  non-convex  optimization, 
which  we  reduce  to  cyclically  solving  convex  optimizations  over  small  subsets  of  the 
parameters.  Each  convex  subproblem  can  be  solved  by  Newton-Raphson,  which  we 
extend  to  the  setting  of  stochastic  Newton-Raphson. 

To  address  the  limitations  of  maximum  likelihood  estimation  in  matrix  factor¬ 
ization  models,  we  extend  our  approach  to  the  hierarchical  Bayesian  setting.  Here, 
Bayesian  estimation  involves  computing  a  high-dimensional  integral  with  no  analytic 
form.  If  we  resorted  to  standard  Metropolis-Hastings  techniques,  slow  mixing  would 
limit  the  scalability  of  our  approach  to  large  sets  of  entities.  We  show  how  to  ac¬ 
celerate  Metropolis-Hastings  by  using  our  efficient  solution  for  maximum  likelihood 
estimation  to  guide  the  sampling  process. 

This  thesis  rests  on  two  claims,  that  (i)  that  Collective  Matrix  Factorization  can 
effectively  integrate  different  sources  of  data  to  improve  prediction;  and,  (ii)  that 
training  scales  well  as  the  number  of  entities  and  observations  increase.  We  consider 
two  real-world  data  sets  in  experimental  support  of  these  claims:  augmented  collabo¬ 
rative  filtering  and  augmented  brain  imaging.  In  augmented  collaborative  filtering,  we 
show  that  genre  information  about  movies  can  be  used  to  increase  the  predictive  accu¬ 
racy  of  user’s  ratings.  In  augmented  brain  imaging,  we  show  that  word  co-occurrence 
information  can  be  used  to  increase  the  predictive  accuracy  of  a  model  of  changes  in 
brain  activity  to  word  stimuli,  even  in  regions  of  the  brain  that  were  never  included  in 
the  training  data. 
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Chapter  1 
Introduction 


Prediction  is  the  pith  and  marrow  of  machine  learning,  the  task  which  defines  it.  Stripped 
of  all  artifice,  the  goal  of  prediction  is  to  guess  the  properties  of  an  object  given  infor¬ 
mation  about  like  objects.1  The  enormous  variety  of  prediction  tasks  stem  from  different 
definitions  of  “property”  and  “like”. 

The  most  common  form  of  prediction  represents  objects  by  a  predetermined  set  of 
features,  or  attributes.  Attributes  are  the  properties  of  an  object  whose  value  can  be  as¬ 
certained.  Ascertaining  the  value  of  attributes  may  be  difficult,  costly,  or  time-consuming, 
and  so  we  wish  to  automate  the  prediction  of  attributes.  Each  object  is  reduced  to  an  as¬ 
signment  of  values  to  attributes,  a  record.  A  set  of  records  is  known  as  a  data  set.  The 
paradigmatic  assumption  in  machine  learning  is  that  these  records  are  exchangeable  draws 
from  a  fixed,  unknown  probability  distribution  over  the  attributes.  This  probability  distri¬ 
bution  provides  a  way  of  reasoning  about  the  behaviour  of  new  objects.  Because  the  data 
forms  a  table  whose  rows  are  records,  and  whose  columns  are  attributes,  we  call  such  data 
tabular  or  attribute- value.  Attribute-value  data  suffers  from  two  important  representational 
limitations,  which  motivate  this  thesis: 

1 .  Entities  must  be  of  the  same  type.  If  objects  are  represented  by  a  fixed  set  of  at¬ 
tributes,  then  all  the  objects  must  possess  those  attributes.  For  example,  if  all  the 
objects  are  human  beings,  then  attributes  like  AGE,  gender,  height  describe  each 
person.  However,  if  we  add  a  teapot  to  the  set  of  persons,  one  would  be  hard-pressed 
to  determine  its  gender. 

'One  may  ask,  “What  is  an  object?”.  Since  this  author  has  no  desire  to  traipse  down  that  ontological 
rabbit  hole,  let  us  simply  assume  that  objects  are  things  that  have  properties.  Evoking  the  literature  on 
relational  databases,  we  usually  refer  to  objects  as  “entities”. 
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2.  There  are  no  relations.  Attributes  are  properties  of  objects;  there  is  no  notion  of 
properties  of  sets  of  objects,  or  relations.  For  example,  if  the  objects  are  people, 
we  would  be  hard  pressed  to  encode  the  notion  of  friendship  as  an  attribute:  first, 
because  friendship  is  a  property  of  two  entities;  second,  because  each  entity  may 
have  different  numbers  of  friends. 

It  is  not  hard  to  see  that  there  are  many  kinds  of  data  that  violate  attribute-value  limitations: 
e.g.,  graphs  where  the  nodes  and  edges  correspond  to  entities  and  relations,  respectively. 
Another  example  is  a  relational  database,  where  the  entities  correspond  to  records,  grouped 
by  type  into  tables.  A  non-trivial  relational  database  must  have  relations,  which  cannot  be 
easily  fit  into  an  attribute-value  representation.  In  Artificial  Intelligence,  predicate  logic 
has  long  been  used  to  encode  relational  data  in  knowledge  bases:  entities  correspond  to 
constants,  relations  and  attributes  are  predicates,  and  logical  sentences  express  the  connec¬ 
tion  between  relations.  For  background  on  the  use  of  logic  for  knowledge  bases,  we  refer 
the  reader  to  Levesque  and  Lakemeyer  [66]. 2 

The  primary  difference  between  attribute- value  and  relational  data  is  the  existence  of 
relations,  or  links,  between  entities.  Because  of  the  relations  between  entities,  standard 
statistical  assumptions,  such  as  independence  of  entities,  are  violated.  Moreover,  the  cor¬ 
relations  due  to  relations  should  not  be  ignored  as  they  provide  a  source  of  information  that 
can  significantly  improve  the  accuracy  of  common  machine  learning  tasks  over  models  that 
exploit  only  the  attributes.  Moreover,  in  many  scenarios,  relations  are  the  properties  we 
want  to  predict. 


1.1  Information  Integration 

Relational  learning  is  a  rich  representation  that  has  found  use  in  many  applications.  How¬ 
ever,  in  this  thesis,  we  focus  our  attention  on  problems  involving  information  integration. 
The  phrase  “information  integration”  is  often  used  to  refer  to  the  problems  involved  in 
merging  different  data  sources:  e.g.,  duplicate  record  elimination,  coreference  resolution, 
record  linkage.  However,  in  this  dissertation,  information  integration  refers  to  the  likely 
motivation  for  the  aforementioned  pre-processing  tasks:  incorporating  different  sources  of 
information  about  entities  to  improve  a  predictive  model. 

2From  the  perspective  of  first-order  logic,  the  things  we  call  relations  are  actually  functions,  i.e.,  a  map¬ 
ping  from  sets  of  entities  to  a  value,  which  need  not  be  a  truth  value  in  logic.  However,  the  phrase  “relational 
learning”  has  come  to  refer  broadly  to  techniques  that  exploit  links  between  entities,  whether  or  not  those 
links  are  encoded  as  relations  in  first-order  logic. 
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While  the  phrase  “information  integration”  is  often  used  to  refer  to  the  problems  in¬ 
volved  in  merging  different  data  sources  (e.g.,  duplicate  record  elimination,  coreference 
resolution,  record  linkage),  we  are  concerned  with  integrating  information  from  different 
data  sources  after  the  these  pre-processing  steps  have  been  performed. 

An  entity  often  has  many  different  properties  (attributes  and  relations).  One  source 
of  data  may  describe  only  a  small  subset  of  all  the  properties  of  a  particular  entity-type. 
Information  integration  problems  address  the  scenario  where  one  has  multiple  sources  of 
data,  each  containing  a  different  subset  of  properties  for  the  same  entities.  The  data  sets 
may  be  collected  independently,  but  the  entities  they  describe  overlap.  If  the  properties 
in  different  data  sources  are  correlated,  information  from  one  data  source  may  be  used  to 
improve  predictions  of  properties  in  other  data  sources. 

Information  integration  problems  are  ubiquitous,  and  in  this  section  we  present  ex¬ 
amples  which  suggest  the  broad  range  of  scenarios  where  the  work  of  this  thesis  may  be 
useful  (the  ones  we  consider  further  in  this  thesis  are  marked  in  boldface): 

>  Web  user  modeling-.  There  are  many  different  kinds  of  information  one  can  collect  about 
a  user:  search  queries,  bookmarks,  online  purchases,  and  measures  of  social  interaction 
among  users,  such  as  e-mail  and  instant  messages.  The  belief  is  that  knowledge  of  a 
user’s  behaviour  in  one  data  source  is  predictive  of  their  behaviour  in  other  data  sources 
(e.g.,  friends  are  more  likely  to  have  similar  interests  than  two  randomly  selected  per¬ 
sons,  and  this  similarity  of  interests  is  reflected  in  their  search  queries). 

>  Gene  function  prediction :  There  are  different  sources  of  information  one  can  collect 
about  a  gene:  manually  annotated  hierarchies,  such  as  the  Gene  Ontology  [125],  loca¬ 
tion  on  the  genome,  sequence  similarity,  and  interaction  between  expressed  proteins. 
The  data  sources  are  often  collected  independent  of  each  other,  but  one  may  wish  to 
augment  a  model  for  prediction  in  one  data  source  (e.g.,  Gene  Ontology)  using  another 
correlated  source  of  data  (e.g.,  interaction  between  expressed  proteins). 

>  Educational  psychometrics:  Standardized  educational  testing  yields  large  quantities  of 
data  regarding  student  performance  on  questions  in  different  areas  (e.g.,  English,  His¬ 
tory,  Algebra,  Calculus).  The  same  students  are  tested  in  the  same  or  different  subjects. 
If  one  believes  that  there  is  transfer  of  skills,  within  or  across  domains  (e.g.,  ability  in 
algebra  is  predictive  of  ability  in  calculus),  then  integrating  student  performance  data 
from  different  subjects  should  improve  a  predictive  model  of  student  performance. 

>  fMRI  modeling:  Functional  Magnetic  Resonance  Imaging  (fMRI)  is  often  used  to  mea¬ 
sure  responses  in  small  regions  of  the  brain  (i.e.,  voxels)  given  external  stimuli.  Given 
enough  experiments  on  a  sufficiently  broad  range  of  stimuli,  one  can  build  models  that 
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predict  patterns  of  brain  activation  given  new  stimuli  [83].  Running  enough  experiments 
is  costly,  but  we  can  often  collect  cheap  side  information  about  the  stimuli.  We  consider 
an  experiment  where  the  stimulus  is  a  word-picture  pair  displayed  on  a  screen.  We 
can  collect  statistics  of  whether  the  stimulus  word  co-occurs  with  other  commonly  used 
words  in  large,  freely  available  text  corpora.  By  integrating  the  two  sources  of  informa¬ 
tion,  related  through  the  shared  set  of  stimulus  words,  we  can  significantly  improve  a 
model  of  brain  activation. 

>  Collaborative  filtering:  Recommendation  systems  involve  using  users’  measured  pref¬ 
erences  for  certain  items  to  select  other  items  the  users  might  be  interested  in.  A  notable 
example  of  this  is  the  Netflix  Prize  challenge,  which  involves  predicting  the  rating  a 
user  would  assign  to  a  movie  given  millions  of  user  ratings  as  training  data.  The  recom¬ 
mendation  problem  involves  a  very  simple  relation:  the  rating  a  user  assigns  to  a  movie. 
However,  we  often  have  other  sources  of  side  information — properties  of  the  users  like 
their  age,  gender,  and  friendship  between  users;  and  properties  of  movies  like  the  genre 
it  belongs  to  and  the  actors  in  the  film.  These  forms  of  side  information  are  themselves 
additional  relations;  can  we  use  them  to  improve  the  quality  of  recommendations? 

As  we  seek  to  model  increasingly  complicated  objects,  the  representational  limitations 
of  attribute-value  learning  become  obvious.  Objects  can  have  many  different  properties 
(attributes  and  relations);  we  cannot  measure  them  all  at  once,  nor  can  we  easily  establish, 
a  priori,  which  properties  are  most  relevant  to  the  task  at  hand.  In  contrast,  a  relational 
representation  allows  for  a  unified  representation  of  heterogeneous  data  sources. 

We  focus  on  two  information  integration  tasks  in  this  thesis:  augmented  collaborative 
filtering  and  augmented  brain  imaging.  In  the  first  task,  we  have  data  about  interactions 
between  users  and  items  (ratings,  purchases),  which  is  augmented  with  side  information 
about  the  items.  In  the  second  task,  we  have  data  about  how  activity  in  small  regions  of 
the  brain  change  in  response  to  a  stimuli,  which  is  augmented  with  side  information  about 
the  stimuli. 


1.2  Thesis  Statement 

Our  goal  is  the  development  of  statistical  techniques  that  can  predict  the  value  of  un¬ 
observed  relations,  by  exploiting  correlations  between  observed  relations.  For  any  such 
technique,  we  have  as  our  desiderata  the  following: 

1.  A  flexible  representation  language :  Relational  data  is  a  richer  representation  than 
its  attribute -value  counterpart,  which  can  complicate  modeling.  Typically,  there  are 
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multiple  relations,  and  each  can  take  on  its  own  value  type.  For  example,  in  the 
augmented  collaborative  filtering  example,  ratings  are  on  an  ordinal  scale,  whereas 
the  other  relations  are  binary.  Our  modeling  approach  should  be  able  to  take  into 
account  relations  with  different  value-types. 

2.  Does  not  require  structural  knowledge :  Often,  we  are  presented  with  data  where 
we  believe  the  relations  are  correlated,  but  the  structure  of  these  correlations  is  un¬ 
known.  Our  techniques  should  not  rely  upon  the  existence  of  prior  knowledge  about 
how  information  propagates  across  relations.  In  the  augmented  brain  imaging  ex¬ 
ample,  one  would  be  hard  pressed  to  elicit  rules  relating  word  counts  to  activity  in 
regions  of  the  brain.  We  believe  that  the  two  data  sources  are  correlated,  but  encod¬ 
ing  structural  information  about  how  they  correlate  is  difficult. 

3.  Generalization  to  new  entities :  Our  training  data  will  contain  a  fixed  set  of  entities, 
but  any  model  we  leam  should  be  able  to  predict  relations  involving  entities  that  did 
not  appear  in  the  training  data,  but  where  some  relations  involving  the  new  entity  are 
observed.  In  the  augmented  collaborative  filtering  example,  we  want  our  model  to 
generalize  well  to  new  users,  not  just  the  ones  who  were  represented  in  the  training 
data. 

4.  Relations  are  sparsely  observed :  One  rarely  observes  a  relation  for  all  combinations 
of  its  arguments.  While  the  data  may  contain  a  large  number  of  entities,  the  typical 
number  of  observed  relations  an  entity  participates  in  may  be  small.  In  the  aug¬ 
mented  collaborative  filtering  example,  only  a  small  fraction  of  items  are  rated  by 
any  single  user. 

5.  Models  reflect  internal  uncertainty.  Since  statistical  models  are  learned  from  finite 
training  data,  there  is  always  some  uncertainty  in  predicted  values.  We  seek  mod¬ 
els  that  can  quantify  their  own  internal  uncertainty  about  the  world.  This  is  most 
commonly  achieved  through  Bayesian  techniques. 

6.  Simple  interface :  While  the  model  may  involve  complicated  routines  for  learning 
and  prediction,  the  end-user  should  not  have  to  concern  themselves  with  the  internal 
details  of  a  relational  model. 

1.3  Statistical  Design  Patterns 

There  are  certain  problems  in  machine  learning  which  recur  in  a  variety  of  different  do¬ 
mains.  For  example,  a  vision  researcher  might  want  to  infer  a  hierarchical  categorization 
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of  images  using  SIFT  features;  while  at  the  same  time,  a  bioinformatics  researcher  is  try¬ 
ing  to  infer  a  gene  ontology  from  sequence  features.  The  objects  and  their  properties  are 
very  different,  but  the  underlying  problem  is  the  same — inducing  a  hierarchical  catego¬ 
rization  of  objects  from  recorded  properties.  Each  researcher  may  independently  create 
very  similar  probabilistic  models  for  inducing  hierarchies,  or  they  might  realize  the  simi¬ 
larity  from  a  literature  review,  and  attempt  to  adapt  related  work  to  their  own  application. 
The  limitation  of  the  cross-pollination  approach  is  that  significant  effort  may  be  required 
to  separate  out  the  domain-specific  aspects  in  related  work.  The  consequence  of  tying 
models  to  specific  applications  is  that  a  lot  of  effort  is  duplicated  unnecessarily. 

There  are  well-developed  subareas  within  machine  learning,  where  commonly  occur¬ 
ring  modeling  tasks  have  been  abstracted  away  from  their  specific  domains,  allowing  them 
to  be  easily  reused  and  adapted  in  other  settings.  One  of  the  best  examples  in  machine 
learning  is  time-series  modeling,  where  Hidden  Markov  Models  are  a  popular  approach. 
The  first  application  of  Hidden  Markov  Models  (HMMs)  were  in  speech  recognition  in  the 
1970s  [100].  Over  time,  the  salient  structure  of  the  problem  was  abstracted  away  from  its 
original  application — i.e.,  the  salient  structure  is  discrete- state  filtering  in  sequential  data 
under  a  Markov  assumption.  Today,  Hidden  Markov  Models  and  their  variants  are  one 
of  the  most  popular  techniques  in  bioinformatics:  different  domains;  common  structure. 
Modeling  sequential  data  is  a  common  problem  in  many  fields,  and  we  recognize  HMMs 
as  a  basic  approach  for  such  problems. 

Graphical  models  can  act  as  an  interface  between  the  low-level  details  of  training  and 
inference  and  high-level  domain-specific  tasks.  Using  graphical  models  as  an  interface 
works  reasonably  well  when  there  are  a  small  number  of  random  variables  (e.g.,  one  ran¬ 
dom  variable  per  attribute).  When  dealing  with  complex,  structured  data  involving  many 
different  types  of  entities,  each  with  their  own  set  of  attributes  and  relations,  the  graphical- 
models-as-interface  approach  is  less  useful.  High-level  languages  for  defining  graphical 
models  have  been  proposed:  Markov  Logic  Networks  [32]3,  Bayesian  Logic  [79],  Prob¬ 
abilistic  Inductive  Logic  Programming  [101],  Probabilistic  Relational  Models  [39],  Rela¬ 
tional  Markov  Networks  [124],  Church  [43],  IBAL  [96],  LACTORIE  [77],  Infer.NET  [80], 
DAPER  [49] .  These  high-level  languages  provide  macros  for  generating  graphical  mod¬ 
els  with  repeated  structure  among  the  variables.  The  macros  are  usually  syntactically 
derived  from  subsets  of  typed  first-order  logic,  although  FACTORIE  and  Infer.NET  are 
based  on  object-oriented  languages,  and  Church  and  IBAL  are  based  on  LISP  and  ML, 
respectively.  From  the  perspective  of  relational  learning,  first-order  logics  are  convenient: 
they  have  compact  descriptors  for  relations,  and  declarative  languages  can  hide  many  of 

3Domingos  and  Lowd  [32]  makes  a  related  argument  about  the  need  for  high-level  languages  for  graph¬ 
ical  models. 
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the  low-level  details  of  constructing  the  underlying  graphical  model.  Making  an  analogy 
to  software  engineering,  graphical  models  are  akin  to  the  intermediate  representation  a 
compiler  uses  when  converting  C  or  Java  code  into  assembly;  these  high-level  languages 
are  akin  to  programming  languages  like  C++,  LISP,  or  Prolog.  Converting  a  description  in 
the  high-level  language  into  a  graphical  model  is  akin  to  compilation. 

While  these  languages  are  enormously  powerful  tools,  they  provide  little  design  guid¬ 
ance  to  the  modeler.  What  we  want  is  the  ability  to  convey  expertise  in  the  design  of 
graphical  models  for  solving  particular  kinds  of  problems.  In  software  engineering,  such 
expertise  is  often  encoded  in  a  design  pattern  [37],  a  reusable  solution  to  a  recurring  prob¬ 
lem  in  software  design.  We  propose  a  similar  concept  for  graphical  models,  a  statistical 
design  pattern ,  a  reusable  solution  to  a  recurring  modeling  problem.  Information  integra¬ 
tion  is  a  statistical  design  pattern  that  appears  under  many  guises  in  the  literature,  e.g., 
in  applications  of  semi- supervised  learning,  transfer  learning,  transduction.  The  approach 
we  propose,  Collective  Matrix  Factorization ,  is  an  example  of  a  statistical  design  pattern 
that  addresses  the  information  integration  problem. 

There  are  a  few  examples  of  statistical  design  patterns  that  have  evolved  naturally — 
e.g.,  Hidden  Markov  Models  and  their  variants  form  a  statistical  design  pattern.  Low-rank 
matrix  factorization  is  another  example  of  a  statistical  design  pattern.  Neither  example 
maps  to  a  specific  model,  but  rather  a  class  of  models  that  address  a  basic  problem.  There 
are  many  variants  of  matrix  factorization,  but  they  all  address  the  same  basic  problem: 
predicting  the  value  of  an  arity-two  relation,  where  the  rows  and  columns  index  each 
of  the  arguments,  and  the  entries  correspond  to  values  of  the  relation.  In  Chapter  3  we 
provide  a  unified  view  of  low-rank  matrix  factorization  algorithms  (e.g.,  singular  value 
decomposition  [42],  non-negative  matrix  factorization  [63]). 

An  advantage  of  statistical  design  patterns  is  that  the  graphical  models  that  implement 
the  pattern  may  have  special  structure  which  makes  training  and  inference  easier.  Such 
structure  is  extensively  exploited  in  training  and  prediction  with  Hidden  Markov  Mod¬ 
els.  Similarly,  we  exploit  structure  in  the  graphical  models  produced  by  collective  matrix 
factorization  to  reduce  the  computational  cost  of  training  and  prediction,  even  though  the 
graphical  model  will  grow  with  the  number  of  entities. 

A  statistical  design  pattern  may  be  encoded  in  a  variety  of  high-level  languages,  but  in 
this  thesis  we  focus  on  collective  matrix  factorization  as  a  graphical  plate  model,  a  kind 
of  high-level  language  itself.  Plate  models  are  also  easily  expanded  into  graphical  models. 
Many  of  our  contributions  focus  on  the  details  of  low-level  optimization,  which  are  easier 
to  describe  as  graphical  models.  The  modeler  need  never  be  exposed  to  the  details  of  the 
graphical  model:  from  their  view,  collective  matrix  factorization  is  presented  purely  in 
terms  of  matrices.  Figure  1.1  diagrams  our  world-view. 
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Domain-specific  tasks  in... 


Figure  1.1:  The  process  of  developing  a  probabilistic  model.  The  modeler/user  begins 
with  a  domain-specific  task.  Once  the  domain  is  mapped  onto  a  graphical  model,  either 
directly  or  using  a  high-level  language,  the  details  of  inference  and  estimation  can  be 
largely  hidden. 

1.4  Main  Contributions 

This  dissertation  can  be  broken  down  in  three  parts.  In  the  first  part,  we  propose  a  unified 
view  of  matrix  factorization,  viewing  matrices  as  the  simplest,  most  common  form,  of  re¬ 
lational  data.  In  the  second  part,  we  propose  representing  a  large  class  of  relational  data 
sets  as  collections  of  related  matrices.  Each  matrix  represents  an  arity-two  relation,  where 
the  rows  and  columns  index  entities.  Two  matrices  are  related  if  they  share  dimensions; 
that  is,  they  describe  the  same  entities  participating  in  different  relations.  Sets  of  related 
matrices  provide  a  simple  interface  for  the  end  modeler.  Beneath  this  interface,  we  pro¬ 
pose  a  statistical  model  based  on  tying  parameters  in  the  low-rank  factorization  of  each 
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matrix.  We  call  this  approach  Collective  Matrix  Factorization.  This  leads  to  a  maximum 
likelihood  parameter  estimation  problem  where  the  parameter  space  grows  with  the  num¬ 
ber  of  entities,  and  where  the  underlying  optimization  is  non-convex.  In  the  third  part,  we 
extend  Collective  Matrix  Factorization  to  a  fully  Bayesian  setting,  which  addresses  many 
of  the  limitations  of  the  maximum  likelihood  approach.  Even  though  we  are  modeling 
relational  data  using  relatively  simple  matrix  factorization  approaches,  we  show  that  (i) 
there  is  a  surprising  amount  of  flexibility  in  the  types  of  problems  one  can  represent  using 
sets  of  related  matrices;  (ii)  the  structure  within  matrix  factorizations  allows  us  to  develop 
very  efficient  algorithms  for  learning  and  prediction. 

This  thesis  stands  on  the  following  contributions,  many  of  which  center  on  novel  algo¬ 
rithms  for  the  large-scale  parameter  estimation  problems  that  arise  from  factoring  sets  of 
related  matrices: 

>  A  unified  view  of  matrix  factorization  that  reduces  the  panoply  of  models  in  the  liter¬ 
ature  into  six  modeling  choices,  independent  of  the  choice  of  learning  algorithm.  This 
approach  subsumes  dimensionality  reduction  techniques,  clustering  algorithms,  and  ma¬ 
trix  co-clustering  algorithms  into  a  single  framework.  We  believe  that  these  choices 
capture  the  important  differences  between  matrix  factorization  models. 

>  An  efficient  maximum  likelihood  estimation  algorithm  for  Collective  Matrix  Factoriza¬ 
tion.  Our  approach  exploits  structure  in  matrix  factorization  models  to  reduce  parameter 
estimation  into  cyclically  updating  parameters  in  a  set  of  tied  linear  models.  This  reduc¬ 
tion  allows  us  to  exploit  not  only  the  gradient  of  the  objective,  but  partial  information 
about  the  Hessian  to  dramatically  speed  up  learning.  Our  alternating  Newton-projection 
approach  can  be  applied  to  any  matrix  (or  collective  matrix)  factorization  where  the 
objective  is  decomposable  and  twice-differentiable. 

>  When  the  data  matrices  are  densely  observed  (i.e.,  where  entities  are  observed  to  par¬ 
ticipate  in  many  relationships)  we  propose  the  use  of  stochastic  Newton  optimization  to 
reduce  the  cost  of  maximum  likelihood  estimation.  While  stochastic  optimization  has 
long  been  used  for  linear  models,  we  are  the  first  to  generalize  this  approach  to  matrix 
factorization. 

>  We  extend  Collective  Matrix  Factorization  to  the  hierarchical  Bayesian  case,  where  a 
posterior  distribution  over  the  model  parameters  is  computed.  This  Bayesian  approach 
allows  us  to  accurately  generalize  to  new  entities,  and  to  account  for  the  effect  of  pa¬ 
rameter  uncertainty  on  predictions.  We  present  experimental  evidence  which  illustrates 
the  merits  of  the  hierarchical  Bayesian  approach  over  maximum  likelihood  in  Collective 
Matrix  Factorization. 
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>  Usually  Bayesian  techniques  reduce  the  expressiveness  of  the  model  to  make  parame¬ 
ter  estimation  easier  (e.g.,  conjugacy  assumptions).  We  propose  a  Metropolis-Hastings 
sampler,  which  imposes  no  such  restrictions.  Every  maximum  likelihood  Collective  Ma¬ 
trix  Factorization  has  a  Bayesian  analogue,  which  we  can  compute  using,  for  example, 
Metropolis-Hastings.  However,  standard  random- walk  Metropolis-Hastings  mixes  very 
slowly,  and  so  this  simple  version  of  the  Bayesian  approach  is  only  practical  on  very 
small  matrices.  Instead,  we  propose  an  adaptive,  block  Metropolis-Hastings  sampler 
where  the  proposal  distribution  is  dynamically  computed  using  the  gradient  and  per-row 
Hessian.  Our  alternating  Newton-projection  technique,  used  for  maximum  likelihood 
estimation,  provides  the  local  gradient  and  per-row  Hessian.  Essentially,  we  are  using 
an  efficient  algorithm  for  maximum  likelihood  inference  to  guide  Metropolis-Hastings, 
reducing  the  cost  of  learning.  Our  adaptive  approach  is  practical  on  much  larger  data 
sets,  such  as  those  found  in  the  augmented  brain  imaging  problem. 

>  Using  the  augmented  collaborative  filtering  and  augmented  brain  imaging  problem,  we 
show  that  integrating  multiple  relations  leads  to  superior  prediction.  This  illustrates  the 
value  of  Collective  Matrix  Factorization  on  information  integration  problems. 


1.5  Organization  of  the  Thesis 

The  remainder  of  the  thesis  is  organized  as  follows: 

Chapter  2:  We  cover  background  material  and  notation  which  will  be  used  throughout 
this  thesis.  A  significant  portion  of  the  chapter  deals  with  probabilistic  graphical  models. 

Chapter  3:  We  present  our  unified  view  of  matrix  factorization,  which  captures  the  im¬ 
portant  modeling  decisions  common  to  all  matrix  factorization  models. 

Chapter  4:  Building  on  a  particular  matrix  factorization  algorithm,  Exponential  Family 
PC  A  [26],  we  introduce  collective  matrix  factorization  as  a  technique  for  modeling  re¬ 
lational  data.  By  exploiting  structure  within  collective  matrix  factorization,  we  develop 
computationally  efficient  techniques  for  learning  the  model  parameters,  even  when  the 
number  of  entities  and  the  number  of  observed  relations  is  large.  We  empirically  evaluate 
our  algorithm  on  an  augmented  collaborative  filtering  task. 

Chapter  5:  We  extend  collective  matrix  factorization  to  a  hierarchical  Bayesian  model. 
We  discuss  the  limitations  of  the  maximum  a  posteriori  inference  used  in  the  previous 
chapter,  and  how  the  hierarchical  Bayesian  model  addresses  these  concerns.  We  show 
how  the  training  algorithm  for  collective  matrix  factorization  can  be  used  to  guide  exact 
Markov  Chain  Monte  Carlo  inference  in  the  hierarchical  Bayesian  variant  of  collective 
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matrix  factorization. 

Chapter  6:  We  review  the  literature  on  statistical  relational  learning,  discussing  how  this 
thesis  relates  to  existing  research  in  relational  learning. 

Chapter  7:  We  summarize  the  main  contributions  of  this  thesis.  We  finally  conclude  with 
a  discussion  of  open  problems  and  directions  for  future  research. 
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Chapter  2 
Background 


In  this  chapter,  we  introduce  terminology  and  concepts  from  statistics  and  optimization 
that  are  relevant  to  this  thesis.  A  large  portion  of  our  work  is  concerned  with  efficient  al¬ 
gorithms  for  maximum  likelihood  and  Bayesian  estimation  in  complex,  high-dimensional, 
distributions.  Section  2.2  is  a  brief  refresher  on  maximum  likelihood  and  Bayesian  estima¬ 
tion.  Maximum  likelihood  estimation  involves  optimizing  the  parameters  with  respect  to  a 
likelihood  function,  so  we  review  the  basic  of  optimization,  including  convexity,  gradient 
descent,  and  Newton’s  method  in  Section  2.3.  Collective  Matrix  Factorization  (Chapter  4) 
is  a  probabilistic  model  with  large  numbers  of  variables,  where  the  joint  distribution  can 
be  expressed  as  the  product  of  factors  over  smaller  subsets  of  variables.  Graphical  models, 
introduced  in  Section  2.4,  are  a  useful  formalism  for  expressing  factorizations  of  complex 
distributions.  In  Chapter  5,  we  propose  a  hierarchical  Bayesian  extension  to  Collective 
Matrix  Factorization,  where  Markov  Chain  Monte  Carlo  allows  us  to  train  the  model. 
Markov  Chain  Monte  Carlo  is  reviewed  in  Section  2.5. 


2.1  Basic  Terminology 

We  introduce  some  basic  notation  and  terminology  used  through  this  thesis: 

>  Vectors  and  scalars  are  denoted  by  lower-case  Roman  or  Greek  letters:  x,  y,  6.  The 
elements  of  a  length  n  vector  are  denoted  x  —  (xi,  x2,  ■  ■  ■ ,  xn).  Vectors  are  assumed  to 
be  row-vectors,  contrary  to  the  usual  convention. 

>  Matrices  are  denoted  by  capital  Roman  or  Greek  letters:  X,  Y,  0.  Rows  and  columns 
of  a  matrix  are  denoted  by  subscript  notation:  Xv,  Y.j,  To  emphasize  the  size  of  a 
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matrix  we  write  Xmxn,  indicating  the  matrix  has  m  rows  and  n  columns.  The  n  x  n 
identity  matrix  is  denoted  InXn- 

>  The  set  of  real  numbers  is  denoted  M.  Real  coordinate  spaces  over  n  coordinates  are 
denoted  Mn.  In  the  case  of  matrices,  real  coordinate  spaces  on  inn  dimensions  are 
denoted  Mmxn. 

>  A  norm  is  denoted  1 1  •  1 1.  The  Euclidean  norm  on  Mn  is  denoted  1 1  •  ||2: 

IMh  =  (|xi|2  +  . . .  +  |xn[2)1/2  . 

The  corresponding  Euclidean  norm  on  on  Rmxn,  the  Frobenius  norm,  is  defined  as 

(m  n 

y  y  Xij 

i= 1  3= 1 


t>  The  trace  ofamxm  square  matrix  A  is 


tr(A)  =  ^2  An. 

2=1 

t>  The  inner  product  between  vectors  x,  y  e  M71  is  denoted  (x,  y)  where 

n 

(x,y)  =  xyT  =  ^ Xiyi ■ 

2=1 

The  inner  product  between  matrices  X,  Y  6  Rmxn  is  denoted  (X,  Y)  where 

m  n 

(X,  Y)  =  tr  (XtY)  -TV  XyYy. 

i= 1  3  = 1 

A  short  hand  notation  for  the  inner  product  is  x  o  y  or  X  o  Y.  The  element-wise 
(Hadamard)  product  between  matrices  or  vectors  is  denoted  x  ©  y,  or  X  ©  Y,  where  the 
arguments  are  of  the  same  dimensions. 

>  The  gradient  of  a  function  /  is  denoted  V/.  The  gradient  of  a  function  with  respect  to  a 
subset  of  its  variables,  x,  is  denoted  S7  xf .  The  Hessian  of  a  function  is  likewise  denoted 
V2/  or  V2/. 
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2.2  Probability  and  Statistics 


In  this  section,  we  assume  the  reader  has  a  basic  understanding  of  probability:  the  con¬ 
cept  of  a  random  variable,  probability  distributions  and  densities,  expectations,  conditional 
probability,  independence,  and  conditional  independence. 

We  denote  random  variables,  vectors,  and  matrices  using  the  same  notation  as  vari¬ 
ables  (see  Section  2.1).  We  disambiguate  random  variables  from  variables  only  when  the 
difference  is  unclear  from  the  context.  A  draw  from  a  random  variable  with  probability 
density  function  1 r(9)  is  denoted 


x  ~  7r(0). 

A  joint  distribution  on  random  variables  {xi,  x2, . . . ,  xn},  with  parameters  6,  is  de¬ 
noted 


p(x i,x2,  ■  ■  .,xn]9). 

The  parameters  of  the  distribution  are  sometimes  referred  to  as  the  model.  Data  V  consists 
of  a  set  of  records,  each  of  which  contains  a  (possibly  partial)  assignment  to  the  n  random 
variables.  The  joint  distribution  also  defines  a  likelihood  function, 

m=p(v\e), 

which,  for  a  fixed  set  of  training  data,  assigns  a  score  to  each  possible  value  of  9. 

Training  (a.k.a.  learning,  parameter  estimation)  is  the  process  of  selecting  a  model,  or 
assigning  scores  to  different  models,  given  a  fixed  set  of  training  data.  Maximum  likeli¬ 
hood  estimation  chooses  the  parameters  which  maximize  the  likelihood  function  (an  opti¬ 
mization  problem  that  we  discuss  further  in  Section  2.3): 

9mle  =  argmax£(6>).  (2.1) 

e 


Bayesian  estimation  is  another  approach  to  determining  the  behaviour  of  6,  where  the 
goal  is  to  determine  the  posterior  distribution  of  the  parameters  given  training  data: 


p(9  |  V) 


p(V  |  9)p(9) 
p{V) 


(2.2) 


The  prior  distribution  over  parameters  p(9),  is  an  assumption  made  on  how  likely  different 
models  are,  prior  to  observing  the  data.  The  posterior  distribution  p(9  V)  is  a  combination 
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of  prior  belief  with  the  likelihood,  which  is  a  function  of  the  training  data.  Computing  the 
evidence, 


(2.3) 


requires  computing  an  integral  over  a  potentially  large  parameter  space.  In  many  cases, 
the  integral  will  not  have  a  closed-form  solution.  We  discuss  techniques  for  estimating  the 
posterior  distribution,  by  sampling  from  it,  in  Section  2.5. 

2.3  Optimization 

Optimization  of  a  continuous  function  plays  a  fundamental  role  in  training  statistical  mod¬ 
els  under  the  principle  of  maximum  likelihood.  This  section  briefly  reviews  the  basics  of 
optimization  that  are  relevant  to  this  dissertation.  For  a  thorough  introduction  to  optimiza¬ 
tion,  see  Boyd  and  Vandenberghe  [13],  Nocedal  and  Wright  [90]. 1 

2.3.1  Optimization  and  Maximum  Likelihood  Estimation 

Mathematical  optimization  (or  just  optimization)  is  at  the  heart  of  maximum  likelihood 
parameter  estimation.  Given  variables  0  6  R"  and  a  function  over  these  variables,  f0  : 
M"  — >  M,  an  optimization  problem  has  the  following  form: 


min  /o(0) 

6 

subject  to  fi{0)  <bi ,  i  —  1 . . .  c. 


(2.4) 

(2.5) 


The  function  /0  is  known  as  the  objective.  The  functions  f,  :  W1  — >  M  are  inequality 
constraints,  and  the  constants  bi  are  the  bounds  on  the  constraints.  If  one  or  more  con¬ 
straints  are  defined,  the  problem  is  known  as  a  constrained  optimization;  otherwise,  it  is 
an  unconstrained  optimization. 

Maximum  likelihood  estimation  (Equation  2.1)  is  framed  as  a  maximization  problem, 
which  can  be  easily  converted  to  a  minimization  problem:  let  fo(9)  =  —  log p(V  \  6)  be 
the  negative  log-likelihood  of  the  model.  If  only  some  values  in  Mn  correspond  to  valid 
parameters,  encode  those  constraints  using  f, . 

'We  follow  the  notation  of  Boyd  and  Vandenberghe  [13],  but  Nocedal  and  Wright  [90]  provides  more 
detailed  coverage  of  optimization  algorithms. 
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2.3.2  Convex  Optimization 


When  the  functions  f0,  /j, . . . ,  /c  are  convex,  Equations  2.4-2. 5  are  known  as  a  convex 
optimization.  A  function  is  convex  if  the  following  inequality  holds  for  all  9i,92  €  Mn 
with  a  >  0,  (3  >  0,  and  a  +  f3  —  1: 

fi(a0i  +  /392)  <  afi(Oi)  +  f3fi(92). 

A  large  number  of  convex  optimizations  are  additionally  linear  programs:  each  function 
fo,  fi,  ■  ■  ■ ,  fc  satisfies 


/.Mi  +  ^2)  —  otfi(9i)  +  Pfi(92). 

Nonlinear  and  non-convex  optimizations  are  those  where  the  objective  and/or  constraints 
violate  the  linearity  and/or  convexity  conditions  above. 

An  optimum  of  the  objective  /0  is  a  value  9  which  is  feasible  (i.e.,  satisfies  the  con¬ 
straints)  and  is  the  infimal  value  in  an  e-ball  around  it: 

fo (0)  =  inf{/o(0)  |  Vi  =  1 ...  c,  m  <  bit  \\9  -  §\\2  <  e}. 

An  optimum  is  local  if  there  is  another  point  (92  that  is  an  optimum  and  fo(92  )  <  fo(0).  A 
global  optimum  9*  is  one  where  there  exists  no  other  point  92  such  that  /0  ( 02  )  <  /0(n 
A  desirable  property  of  convex  optimizations  is  that  any  local  optimum  is  also  a  global 
optimum.  Often,  in  non-convex  optimization,  the  best  guarantee  one  can  provide  is  con¬ 
vergence  to  a  local  optimum. 

2.3.3  Techniques  for  Unconstrained  Optimization 

The  maximum  likelihood  estimation  problems  encountered  in  this  thesis  are  most  often 
unconstrained  optimizations,  where  the  objective  fo  is  differentiable.  Therefore,  a  neces¬ 
sary  and  sufficient  condition  for  6*  to  be  a  stationary  point  is 

V/o(0*)  =  0.  (2.6) 

In  some  cases  there  is  an  analytic  solution  to  Equation  2.6.  In  most  cases  a  solution  is 
found  by  an  iterative  algorithm  that,  starting  at  initial  value  9^\  generates  a  sequences  of 
feasible  iterates  9^\  . . . ,  9<'°°'>  such  that  f0(9('1'>), . . . ,  fo(9^°°'>)  is  a  sequence  that  converges 
to  fo(9*).  If  fo  is  convex,  the  iterative  algorithm  converges  to  a  global  optimum.  However, 
the  same  algorithms  can  be  applied  to  non-convex  optimizations,  with  the  caveat  that  they 
may  (and  typically  will)  converge  to  a  local  optimum. 
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Gradient  descent  (a.k.a.  method  of  steepest  descent)  is  a  common  iterative  approach  to 
solving  Equation  2.6: 

0(t+i)  =  6{t)  -  r)  •  V/O(0W) . 

-SW 

The  stopping  criteria  checks  whether  1 1  V/o(0^+1^)  1 12  <  e0  for  a  small  positive  value  of  e0. 
The  step  length  r]  G  [0, 1]  is  chosen  to  guarantee  sufficient  decrease  of  the  objective.  In  the 
case  of  gradient  descent,  the  step  s®  is  the  negative  gradient. 

A  common  condition  to  testing  whether  a  step  in  the  direction  of  s(r>  with  length  // 
leads  to  sufficient  decrease  is  the  Armijo  condition: 

fo  {0{t)  +  V  ■  V/0  (0(t)))  -  f0  {0(t))  <  clV  <  V/0,  5(f)>  •  (2.7) 

for  some  constant  c\  G  (0, 1).  A  common  approach  to  finding  a  good  step  length  involves 
starting  with  r)  —  1,  and  then  testing  step  lengths  {/3r),  /32rj,  /33rj . . .},  for  [3  G  (0, 1),  till 
sufficient  decrease  is  confirmed,  or  the  step  length  is  deemed  insufficiently  large. 

Newton-Raphson  (a.k.a.  Newton’s  Method)  is  an  iterative  approach  to  solving  Equa¬ 
tion  2.6  that  takes  the  local  curvature  of  /0  at  each  iterate  (){l}  into  account: 

Q(t+1)  =  0(t)  _  v  .  v/o(0W)  [V2/o(0(t))] _1  •  (2.8) 

' - -v - " 

-sffl 

As  with  gradient  descent,  the  Armijo  condition  may  be  used  to  test  for  sufficient  decrease 
of  the  objective.  We  note  that  if  0  G  Mn,  the  Hessian  will  bean  x  n  matrix.  The  cost  of 
the  Newton-Raphson  step  is  usually  dominated  by  the  0(n3)  cost  of  inverting  the  Hessian. 

While  one  iteration  of  Newton-Raphson  is  more  expensive  than  one  of  gradient  de¬ 
scent,  Newton-Raphson  has  a  superior  rate  of  convergence:  fewer  iterations  are  required 
to  converge  to  an  optimum.  A  detailed  comparison  of  rates  of  convergence  is  beyond  the 
scope  of  this  section,  but  we  can  loosely  characterize  the  rate  of  convergence  for  gradient 
descent  as  linear:  converges  to  its  minimum  at  a  linear  rate.2  For  Newton- 

Raphson,  there  exists  a  sequence  of  iterates  sufficiently  close  to  6*  that  { 1 1  V/o(0^ )  1 1  }t :  t>t0 
converges  quadratically  to  zero.3 

2 A  sequence  {x^}  that  converges  to  x*  has  a  linear  rate  of  convergence  in  t  if  there  is  a  constant 
r  £  (0,1)  such  that  ||3.(t)_x»||  <  r.  The  same  sequence  has  a  Q-order  of  convergence  p,  for  p  >  1  if  there 

is  a  constant  M  >  0  such  that  <  M  for  some  t  >  t o .  When  p  =  2  the  Q-order  of  convergence 

is  commonly  referred  to  as  quadratic  convergence  [90,  Section  2.2]. 

3The  rate  of  convergence  characterisations  assume  that  the  line  search  is  exact:  //  is  chose  to  maximize 
the  decrease  in  the  objective  between  iterations.  Newton-Raphson  requires  additional  conditions  be  true  in 
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2.4  Probabilistic  Graphical  Models 


Probabilistic  graphical  models  are  compact  representations  of  a  probability  distribution 
over  random  variables  Xi, . . . ,  Xn.  The  joint  distribution  of  the  random  variables  is  fac¬ 
tored  into  the  product  of  potential  functions  over  subsets  of  the  variables: 

P(x i,  ■  ■  ■ ,  xn]  9)  =  ^  Y[  il>c(xc),  where, 

^  CeC 

2=e  e  n  Mxc), 

X±  Xn  GGC 

xc  Q  {%1,X2, . .  -,xn}, 

^c(xc)  >  0, 

We  refer  to  the  subsets  of  variables,  C,  as  cliques.  The  cliques  are  groupings  of  variables 
over  which  the  potentials  are  defined,  and  are  related  to  the  independence  structure  of  the 
graphical  model.  If  a  variable  is  continuous,  then  replace  the  corresponding  summation 
with  integration.  The  parameters  0  are  the  set  of  parameters  used  to  define  the  potential 
functions  -0c- 

2.4.1  Directed  Graphical  Models 

In  a  directed  graphical  model,  each  potential  function  is  the  probability  distribution  of  a 
variable  given  parents: 

n 

p(x  1,  ■  ■  ■ ,  xn]  9)  =  TT  p(xt  I  Xni]  0i) . 

ti - - - ' 

fi  fa) 

The  parents  of  variable  Xi  are  denoted  xni .  The  parameters  of  the  conditional  probability 
distribution,  which  determines  the  distribution  of  xt  given  the  value  of  .xv,: ,  is  denoted  0, . 
If  we  draw  a  graph  where  each  xt  corresponds  to  a  node,  and  each  directed  edge  from 
Xj  G  xni  to  Xi  corresponds  to  a  dependence,  then  the  resulting  graph  must  be  acyclic  for 
the  model  to  be  directed.  An  example  of  a  directed  acyclic  graph  (DAG)  on  five  variables 
is  presented  in  Figure  2.1.  The  parameters  of  the  graphical  model  are  6  =  {6*i, . . . ,  9n}- 

A  useful  property  of  directed  graphical  models  is  that  the  product  of  the  conditional 
probability  distributions  (CPDs)  is  the  joint  distribution,  and  so  the  normalizing  term  Z  = 

a  neighbourhood  around  9*,  such  as  Lipschitz  continuity  of  the  Hessian.  We  refer  to  reader  to  [90,  Section 
3.3]  for  details. 
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Figure  2.1:  A  directed  acyclic  graph  on  random  variables  {sq, . . . ,  xn}. 


1.  Since  the  log-normalizer  does  not  involve  a  high-dimensional  integration,  maximum 
likelihood  estimation  is  straightforward  if  all  the  variables  are  observed: 


fo(0)  =  log p(^x\ , . . .  ,x„;0) 


n 


(2.9) 


i= 1 


To  minimize  /0,  we  can  optimize  independently  over  each  term  in  the  summation  in  Equa¬ 
tion  2.9.  The  optimization  is  further  simplified  by  assuming  that,  for  each  assignment  to 
xnv  p(xi  |  xni ;  9i)  is  an  exponential  family  distribution  (Definition  2)  which  depends  only 
on  the  value  of  xni  and  0, . 


2.4.2  Undirected  Graphical  Models 

Given  an  arbitrary  joint  distribution  there  are  often  many  ways  to  factor  it  into  the  nor¬ 
malized  product  of  potentials.  In  some  cases  we  are  fortunate,  and  the  factorization  can 
be  represented  compactly  using  a  directed  graphical  model.  Undirected  graphical  mod¬ 
els  do  not  require  that  the  graph  of  dependencies  between  variables  be  acyclic.  The  most 


common  form  of  undirected  graphical  model  uses  a  log-linear  formulation: 


where  /)■(•)  e  {0, 1}  is  a  function  of  the  state.  While  we  do  not  use  undirected  graphical 
models  in  this  dissertation,  the  log-linear  formulation  has  found  use  in  other  techniques 
for  relational  learning:  e.g.,  Markov  Logic  Networks  [105],  and  Relational  Markov  Net¬ 
works  [124]. 

2.4.3  Inference 

In  visual  representations  of  a  graphical  model,  we  shade  nodes  whose  values  are  known, 
i.e.,  evidence.  The  most  fundamental  use  of  a  joint  distribution  involves  computing  the 
probability  of  an  event  given  (partial)  evidence — i.e.,  inference.  If  no  restrictions  are 
placed  on  the  graphical  model,  then  inference  on  a  directed  graphical  model,  framed  as 
a  decision  problem,  is  NPpp-complete  [93]. 4  In  the  case  of  Collective  Matrix  Factorization 
(Chapter  4),  inference  is  additionally  complicated  by  the  fact  that  one  of  the  basic  oper¬ 
ations  in  inference,  joining  potentials  and  eliminating  variables,  may  involve  computing 
integrals  with  no  analytic  form. 

2.4.4  Plate  Models 

Plate  models  are  a  language  for  defining  repeated  structure  in  directed  graphical  mod¬ 
els  [18,  118].  When  defining  the  directed  acyclic  graph  among  variables,  a  plate  is  a  nota- 
tional  short-hand  for  replication  of  a  variable  and  the  arcs  that  represent  dependence.  Four 
examples  of  plates  notation  and  the  equivalent  unrolled  model  are  presented  in  Figure  2.2. 

Plates  are  usually  denoted  by  square  boxes  around  sets  of  variables.  Each  plate  con¬ 
tains  one  or  more  variables,  and  there  is  no  requirement  that  a  variable  exist  in  a  plate. 
However,  each  variable  in  a  plate  is  indexed  by  a  subscript.  An  annotation  to  the  plate 
indicates  how  many  times  the  subscripted  variable  is  repeated.  A  plate  is,  in  effect,  a  for- 
loop.  Like  a  for-loop,  unrolling  a  plate  consists  of  creating  a  variable  for  each  iteration 
of  the  plate  annotation  (e.g.,  i  —  1 . . .  n).  If  a  variable  has  multiple  subscripts,  then  it 
belongs  to  multiple  plates:  e.g.,  intersecting  plates,  Figure  2.2(d).  We  note  that  random 

4The  complexity  result  requires  finite  bit-length  of  the  inputs:  i.e.,  discrete  variables  with  tabular  CPDs. 
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(a)  Expansion  of  a  plate  with  no  parents. 


(b)  Expansion  of  a  plate  with  shared  parent. 


i  =  1 . . .  m 


Figure  2.2:  Examples  of  plate  models  and  the  corresponding  unrolled  network. 


variables  in  a  plate  model  or  directed  acyclic  model  may  be  scalar  or  a  vector  of  random 
variables,  a  notational  convenience  we  use  when  defining  the  plate  model  for  Collective 
Matrix  Factorization  (Chapter  4). 


2.5  Bayesian  Learning 


In  Section  2.2,  we  discussed  the  central  role  of  the  posterior  distribution  p{9  \  V)  in  Bayesian 
learning.  If  we  wish  to  use  the  posterior  for  prediction,  e.g.,  to  predict  the  likelihood  of 
new  test  data,  J)new,  we  must  average  over  the  posterior  distribution: 


p(Vr 


V)=  I  p(Vr 


6)p(e  I  V)  dO. 


(2.10) 
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The  integration  in  Equation  2.10  is  often  intractable,  and  so  we  approximate  it  by  the 
following  Monte  Carlo  estimate: 


0(s)  ~p(0  \V). 


The  Monte  Carlo  estimate  requires  samples  from  the  posterior  distribution.  The  hard  part 
of  Monte  Carlo  estimation  is  generating  samples  from  a  complex,  high-dimensional  dis¬ 
tribution  that  may  not  even  have  an  analytic  form.  Metropolis-Hastings  is  a  particularly 
flexible  approach  for  generating  the  required  samples. 

2.5.1  Metropolis-Hastings 

One  of  the  oldest  techniques  that  falls  under  the  rubric  of  Markov  Chain  Monte  Carlo 
(MCMC)  is  the  Metropolis-Hastings  (MH)  algorithm.  While  it  is  often  difficult  to  sample 
directly  from  a  particular  target  distribution  n(9),  we  may  define  a  Markov  chain  whose 
states  are  the  possible  values  of  9,  and  whose  steady  state  distribution  is  "(6):'  The  fol¬ 
lowing  presentation  largely  follows  Chib  and  Greenberg  [23]. 

We  have  assumed  that  9  e  Mn,  and  so  a  continuous  state  Markov  chain  over  Mrt  must 
be  defined.  Let  dy  C  Rn.  We  denote  the  probability  of  jumping  from  6  to  the  infinitesimal 
region  dy  as  P(0,  dy).5 6  Suppose  that  we  define  the  transition  distribution  P(-,  •)  as 


P{9,  dy)  =  Pmh(0,  y)  dy  +  r(6)8g(dy), 


(2.11) 


Pmh(&,  9)  —  0, 
PMH(0,y)  >  0, 


(2.12) 


5Most  commonly,  n(9)  is  the  posterior  distribution  p{9  \  V).  However,  Metropolis-Hastings  can  be  used 
to  sample  from  any  probability  distribution. 

6P(9 ,  dy)  is  analogous  to  the  transition  matrix  in  a  discrete-state  Markov  chain. 
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Informally,  Pmh(9,  y)  dy  is  a  continuous  density  that  describes  the  probability  of  the  ran¬ 
dom  walk  moving  from  9  to  dy  where  9  ^  dy,  r{9)  is  the  probability  that  the  random 
walk  remains  at  its  current  state,  9.  Since  it  is  possible  that  r{9)  ^  0,  Pmh(9,  y)  dy  may 
not  integrate  to  1.  If  the  distribution  PMH(9,y)  satisfies  a  condition  known  as  detailed 
balance,  then  the  continuous  state  Markov  Chain  defined  by  P{9 ,  dy)  has  a  unique  steady 
state  distribution,  which  is  n (6).  The  detailed  balance  condition  on  Pmh( ■,  ■)  is 


VT  (6)pMH(0,y)  =  n(y)PMH(y,8)- 


In  the  terminology  of  Markov  chains,  Pmh{9,  y)  is  the  probability  of  transitioning  from 
state  9  to  state  y. 

We  need  a  distribution  pmh  which  exhibits  detailed  balance,  and  is  easy  to  sample 
from.  Metropolis-Hastings  gives  us  a  procedure  for  creating  an  appropriate  pMH  given 
a  distribution  over  states,  q(-,  •),  which  is  easy  to  sample  from;  but  which  may  not  ex¬ 
hibit  detailed  balance.  The  easy-to-sample  distribution  q  is  often  known  as  the  proposal 
distribution. 

Let  us  assume  that  the  proposal  distribution,  q,  has  the  same  support  as  the  distribution 
we  wish  to  sample  from.  Next,  define  a  correction  function  a(-,  ■)  £  [0, 1],  which  we 
choose  to  enforce  detailed  balance.  In  Metropolis-Hastings,  Pmh{9 ,  y)  =  q(9,  y)a(9,  y). 
Therefore,  the  detailed  balance  condition  is 


7T  (9)pMH(9,y)  =  7T  (y)pMH(y,0) 

n(0)q(0,y)a(O,y)  =  ^{y)q{y,8)a(y,e). 


(2.13) 

(2.14) 


Given  detailed  balance,  what  form  should  «(•,  •)  take?  Assume  that  there  is  a  violation 
of  detailed  balance  where  the  walk  does  not  move  frequently  enough  from  y  to  9,  i.e. 
ir(9)q(9,  y)a(9,  y)  >  n (y)q(y,  9)a(y ,  9).  To  correct  this  imbalance,  it  makes  sense  to  pick 
the  largest  possible  value  for  a(y,9).  Since  pmh  is  a  probability  distribution,  a(-,  •)  £ 
[0, 1],  and  so  the  largest  possible  values  of  a(y,  9)  is  1.  In  the  scenario  we  have  described, 
the  detailed  balance  condition  yields  the  form  of  a(9,  y): 


*{0)q{0,  y)at(9,  y)  =  n(y)q(y,  9)  a(y,  9) 


=i 


Notice  that  we  do  not  even  need  to  be  able  to  evaluate  q,  just  ratios  of  q — one  need  not 
even  know  the  normalizing  constant  for  the  proposal.  Since  a(-,  ■)  must  be  a  probability, 
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we  define  a  (9,  y )  as 


a(9,  y)  =  min  1, 


n(y)q(y,9) 

n(9)q(9,y) 


It  is  straightforward  to  check  that  this  definition  of  a(9,  y)  satisfies  detailed  balance  while 
ensuring  that  0  <  a(9,y )  <  1.  If  we  start  the  walk  at  a  state  where  "(9)  >  0,  and  the 
support  of  the  proposal  contains  the  support  of  7r(-),  then  the  only  way  it (9)q(9,  y)  =  0  is 
if  "(9)  =  0.  The  probability  of  accepting  a  transition  into  a  state  where  n(9)  =  0  is  zero, 
and  so  o(9.  y)  can  be  set  arbitrarily  to  1.  Distilling  the  discussion  above,  we  can  express 
the  correction  function  as 


The  correction  function  «(•,  •)  is  also  known  as  the  acceptance  probability  of  a  transition 
in  the  Markov  chain.  Given  our  choice  of  Pmh,  we  can  interpret  the  Markov  chain  defined 
by  P(9,  dy )  as  follows:  if  the  Markov  chain  is  at  state  9,  then  we  transition  to  a  state  drawn 
from  proposal  q(9,  y)  with  probability  a(9,  y).  Otherwise,  the  draw  from  the  proposal  is 
rejected  and  the  Markov  chain  remains  at  state  9.  Once  the  random-walk  over  states  has 
converged  to  its  steady  state,  the  probability  of  hitting  state  9  is  "(9). 

Metropolis-Hastings  simply  involves  simulating  the  random-walk  defined  by  P(9.  y) 
to  generate  approximate  samples  from  n(9).  Pseudocode  for  simulating  the  random- walk 
is  provided  in  Algorithm  1.  The  output  is  a  collection  of  samples  {9®}J=1,  which  are  not 
guaranteed  to  be  independent.  Since  unbiased  Monte  Carlo  estimation  requires  indepen¬ 
dent  samples  from  "(9),  we  filter  the  collection  of  samples  to  minimize  serial  correlation. 
First,  we  throw  away  the  first  t  <  t0  burn-in  samples,  where  to  is  chosen  by  the  user. 
Metropolis-Hastings  is  a  random  walk  computation  of  the  steady  state  of  a  Markov  chain: 
it  takes  time  for  the  walk  to  converge  to  its  steady  state.  Among  the  remaining  samples, 
where  t  >  to,  there  can  still  be  correlations  between  9lj>  and  9<l+k),  especially  if  k  is  small. 
Subsampling  simply  throws  away  all  but  every  mth  sample,  mitigating  the  effect  of  serial 
correlation.7  After  disposing  of  the  burn-in  samples,  and  subsampling,  we  are  left  with  a 
smaller  set  of  samples,  which  we  can  use  to  produce  estimates  of  functionals  of  n(9). 

7  Subsampling  mitigates  the  effect  of  serial  correlation,  but  increases  the  variance  of  Monte  Carlo 
estimates  and  increases  the  amount  of  sampling.  In  Markov  Chain  Monte  Carlo,  the  choice  of 
bias/variance/computation  tradeoff  is  left  to  the  user. 
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Algorithm  1:  Metropolis-Hastings  Algorithm. 

Input:  Initial  parameters  0'{<> ;  Proposal  distribution  p(9.  y);  Number  of  samples  T. 
Output:  Collection  of  samples  {9^,9^, . . . ,  9^} 
for  t  =  0  ...  T  —  1  do 
Sample  u  ~  U{ 0, 1]. 

Sample  proposed  parameters  9*  ~  •). 

if  u  <  a{9^\  9*)  (see  Equation  2.15)  then  0(t+1)  =  9* 

else  6>(*+1)  =  0W 

end 


Random  Walk  Metropolis-Hastings 

The  design  of  a  Metropolis-Hastings  algorithm  consists  largely  of  the  choice  of  p(9,y), 
the  proposal  distribution.  In  the  case  of  Random  Walk  Metropolis-Hastings,  the  proposal 
distribution  is  a  multivariate  Gaussian  distribution: 

p{9,y)  =  N(6,v  ■  Inxn ), 

i.e.,  the  mean  is  the  current  state  in  the  chain,  and  the  covariance  is  spherical  with  variance 
v,  which  must  be  chosen  by  the  modeler.  If  v  is  too  small,  then  the  Markov  chain  will  take 
a  long  time  to  mix;  if  v  is  too  large,  then  n(9*)/n(9)  will  be  small,  and  so  the  probability 
of  accepting  the  proposal  will  be  small.  Essentially,  we  want  p{9^\  ■)  to  be  a  good  ap¬ 
proximation  of  n(9^).  Balancing  v  between  these  two  concerns  often  requires  substantial 
human  intervention.  Moreover,  a  choice  of  v  that  approximates  the  density  around  7r(0(°>) 
may  not  be  a  good  approximation  of  the  density  around  n(9^),  t  >  0.8  A  poor  choice 
of  v  can  force  the  modeller  to  accept  that  the  underlying  chain  will  be  slow  to  mix:  i.e., 
large  T,  slow  training;  or,  that  the  samples  will  not  be  a  good  representation  of  7r(0).  In 
Chapter  5,  we  propose  a  technique  for  automatically  creating  a  proposal  distribution  that 
dynamically  approximates  the  posterior  distribution  in  hierarchical  Bayesian  Collective 
Matrix  Factorization. 


Block  Sampling 

If  6  e  Mn  contains  many  parameters,  then  defining  a  good  proposal  distribution  may  be 
difficult.  We  can  partition  the  state  into  n  variables:  6  =  (01; . . . ,  (),, ) .  The  Metropolis- 
Hastings  transition  distribution,  Equation  2.11,  can  be  defined  on  each  of  the  n  vari¬ 
ables,  assuming  that  the  others  are  fixed.  It  can  be  shown  that  cyclically  sampling  from 

8A1so,  there  may  be  no  v  that  approximates  the  density  very  well  if  it  is  highly  non-spherical. 
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p(9i  0  \  Oj ) ,  for  each  n,  yields  a  transition  in  a  Markov  chain  whose  steady  state 

distribution  is  also  7r(9).  Even  more  generally,  we  can  block  6  into  B  groups  of  variables, 
sampling  from  the  conditional  distribution  of  each  block  given  that  the  rest  of  the  variables 
are  fixed.  Blocking  reduces  the  problem  of  defining  a  proposal  distribution  over  all  the  pa¬ 
rameters  into  defining  conditional  proposal  distributions  over  subsets  of  the  parameters.  It 
is  often  substantially  easier  to  define  these  conditional  proposal  distributions,  and  it  may 
be  easier  to  sample  from  them  as  well. 


Gibbs  Sampling 


Gibbs  Sampling  is  an  example  of  a  block  Metropolis-Hastings  sampler  where  the  condi¬ 
tional  sampling  distribution  of  variable  6i  (or  block  i)  is  [>((),  \9\6i)  —  n(9i\9\9i).  It  can 
be  shown  that  with  this  choice  of  proposal  distribution,  the  acceptance  rate  a ( 0 ,  9*)  =  1. 

The  catch  of  Gibbs  sampling  is  that  we  must  be  able  to  compute  the  conditional  sam¬ 
pling  distribution.  If  tt{6 )  is  a  posterior  distribution,  then  the  conditional  sampling  distri¬ 
bution  will  involve  an  integration  problem  similar  to  the  one  in  Equation  2.2: 


p(9i\9\9i,V) 


p(V,  9  \  9j  |  9i)p(9j) 
p(V,9\9l) 


If  the  prior  and  likelihood  are  conjugate,  then  it  is  straightforward  to  compute  the  condi¬ 
tional  sampling  distributions. 


2.5.2  Bayesian  Model  Averaging 

The  purpose  of  the  posterior  distribution  p{9  \  V)  is  the  role  it  plays  in  Equation  2.10,  the 
posterior  predictive  distribution.  Why  should  we  use  the  posterior  predictive  distribution? 
The  posterior  predictive  distribution  is  a  form  of  Bayesian  Model  Averaging  [102,  103, 
50,  130],  whose  rationale  we  discuss  in  this  section.9 

A  substantial  source  of  confusion  surrounding  Bayesian  Model  Averaging  stems  from 
the  name:  What  is  a  model?  What  quantities  are  being  averaged?  The  word  “model” 
is  so  overloaded  in  practice  that  we  prefer  the  phrase  hypothesis.  A  hypothesis  hi  is  a 
set  of  probability  distributions  indexed  by  parameters  9 — each  distribution  is  uniquely 
identified  by  a  value  of  9.  A  composite  hypothesis  is  a  set  with  more  than  one  probability 
distribution;  a  simple  hypothesis  is  a  set  with  exactly  one  probability  distribution.  We 

9Many  of  the  ideas  in  Bayesian  Model  Averaging  go  back  to  the  early  1960s  [33].  We  recommend 
Hoeting  et  al.  [50]  and  Wasserman  [130]  as  tutorial  references. 
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use  the  standard  set  notation  h  €  H  to  refer  to  a  particular  distribution  in  the  hypothesis; 
9  is  used  to  denote  a  variable  whose  value  identifies  a  particular  distribution.  The  word 
“model”  in  Bayesian  Model  Averaging  can  refer  both  to  a  composite  hypothesis  and  to  a 
simple  hypothesis.  In  this  section,  we  consider  the  more  general  form  of  Bayesian  Model 
Averaging,  where  model  refers  to  a  composite  hypothesis;  later,  we  consider  another  form 
of  Bayesian  Model  Averaging,  where  model  refers  to  the  distributions  within  a  chosen 
hypothesis.  In  Chapter  5  we  use  the  latter  form  of  Bayesian  Model  Averaging,  where 
models  correspond  to  parameters  of  a  hypothesis. 

A  simple  example  of  the  difference  between  hypotheses  and  parameters  is  described 
in  Wasserman  [130]:  consider  a  collection  of  independent  flips  of  a  coins,  {Y,;}"=1,  Y,  e 
{0, 1}.  One  hypothesis,  Hi,  may  be  that  the  coin  flips  are  Bernoulli  distributed,  p(Y ;  9)  = 
9l  (1  —  9)1~}  ,  where  6  =  1/2.  Another  hypothesis,  H2,  may  be  that  the  coin  flips  are 
Bernoulli  distributed  where  6  7^  1/2.  H\  is  a  simple  hypothesis;  H2  is  a  composite  hy¬ 
pothesis.  To  this  example  we  could  add  other,  more  exotic,  hypotheses:  e.g.,  Hz,  that  the 
coin  flips  are  discretizations  of  draws  from  a  standard  normal  distribution. 

Given  a  finite  collection  of  hypotheses  Hi, ... ,  Hr,10  Bayesian  Model  Averaging  pro¬ 
poses  that  we  average  our  predictions  on  new  data  T>new  under  hypothesis  Hk  by  the  prob¬ 
ability  of  hypothesis  Hk  given  the  training  data,  V: 

K 

p(vnew  I V)  =  I  nk,V)p(Hk  I  V).  (2.16) 

k= 1 


By  Bayes’  rule 


p{Hk\V)<xp(V\Hk)p(Hk). 

Every  hypothesis  Hk  is  indexed  by  parameters  6k,  whose  value  is  uncertain.  In  true 
Bayesian  fashion,  we  specify  a  prior  over  the  parameters,  p(9k  \  Hk)-  The  prior  over  pa¬ 
rameters  allow  us  to  integrate  them  out: 

p(V\Hk)  =  f  p(V\9k,Hk)p(9k\Hk)d9k.  (2.17) 


We  can  view  p(V  \  Hk)  as  a  hypothesis -dependent  mapping  from  data  sets  to  a  score, 
the  probability  of  that  data  set.  Figure  2.3  illustrates  this  evidence  mapping  for  two  differ¬ 
ent  hypotheses.  Given  two  hypothesis  under  which  the  training  data  is  the  most  probable 

10While  there  are  a  finite  set  of  hypotheses  proposed  by  the  modeller,  each  hypothesis  may  contain  an 
infinite  set  of  distributions  (c.f.,  'H  \  in  the  coin  flip  example,  above). 
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Figure  2.3:  An  illustration  of  the  evidence  given  two  different  hypotheses:  Tt\  and  7i>.  The 
less  flexible  hypothesis,  TL\  models  fewer  data  sets  well  than  the  more  flexible  hypothesis 
7-f2-  Two  data  sets,  V i  and  V2,  are  marked  on  the  x-axis. 


of  all  possible  data  sets,  we  prefer  the  one  that  fits  fewer  data  sets  well — i.e.,  the  less 
flexible  hypothesis.  Typically  the  more  flexible  hypothesis  is  one  which  contains  many 
more  distributions  than  the  other.  The  larger  hypothesis  may  fit  the  training  data  purely  by 
chance,  leading  to  a  scenario  where  the  training  error  is  low,  but  the  test  error  is  high  [30]. 


Posterior  Predictive  Distributions  and  Bayesian  Model  Averaging 


Much  of  the  ambiguity  surrounding  Bayesian  Model  Averaging  stems  from  the  fact  that 
there  are  two  averaging  operations  in  the  previous  section:  over  hypotheses  and  over  pa¬ 
rameters.  Colloquially,  models  can  refer  to  both  hypotheses  and  parameters.  If  we  conflate 
hypotheses  and  parameters,  then  Bayesian  Model  Averaging  is  equivalent  to  integrating 
out  the  uncertainty  over  the  parameters. 1 1 

Let  us  restrict  our  consideration  to  a  single  composite  hypothesis,  H,  indexed  by  pa¬ 
rameters  6.  In  this  case,  the  posterior  predictive  distribution, 


p{Vnew  |  V,  H) 


J  p(Vnew  |  9,  H)p{9  |  V,  TL)  d6 , 


(2.18) 


"Conflating  hypotheses  and  parameters  has  the  same  effect  as  assuming  that  only  one  hypothesis  is 
possible  in  the  prior. 
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closely  resembles  Equation  2.16,  where  the  summation  over  hypotheses  Hi, . .  .Hr  has 
been  replaced  by  an  integration  over  the  parameters  6. 12  The  most  important  difference  be¬ 
tween  Equations  2.16  and  2.18  is  that  the  former  allows  for  composite  hypotheses,  which 
can  overlap  and  vary  in  size. 

To  distinguish  Equation  2.16  from  2.18  we  refer  to  the  latter  as  Hypothesis-Specific 
Bayesian  Model  Averaging.  In  Hypothesis-Specific  Bayesian  Model  Averaging,  the  mod¬ 
els  do  not  overlap,  and  they  have  the  same  degrees  of  freedom.  While  Hypothesis-Specific 
Bayesian  Model  Averaging  can  only  make  statements  vis-a-vis  a  particular  hypothesis,  it 
is  not  subject  to  many  of  the  criticisms  lobbed  at  Bayesian  Model  Averaging  over  multiple 
composite  hypotheses,  which  often  stem  from  a  poor  choice  of  Hi, ,  HK. 

In  Collective  Matrix  Factorization,  a  composite  hypothesis  corresponds  to  the  structure 
of  a  graphical  (plate)  model.  The  parameters  of  the  graphical  model  index  the  hypothesis. 
In  Chapter  5  we  use  Hypothesis-Specific  Bayesian  Model  Averaging  to  integrate  out  the 
posterior  uncertainty  in  the  parameters  of  Hierarchical  Bayesian  Collective  Matrix  Factor¬ 
ization.  We  compare  the  Bayesian  approach  against  the  same  model  whose  parameters  are 
selected  under  maximum  a  posteriori.  The  difference  between  the  two  approaches  is  that 
of  model  selection  (where  prediction  is  done  using  the  maximum  a  posteriori  estimate)  and 
soft  model  selection  (where  prediction  is  done  using  the  posterior  predictive  distribution). 


Model  Averaging,  Model  Combination,  and  Model  Selection 

There  are  at  least  three  procedures  involving  hypotheses  and/or  parameters  that  are  often 
confused  for  one  another: 

>  Model  Averaging:  here  we  average  over  different  hypotheses,  i.e.,  Equation  2.16;  or,  we 
average  over  different  parameters  in  the  same  hypothesis,  i.e.,  Equation  2.18. 

>  Model  Combination:  here  we  take  different  simple  hypotheses  hi, ...  ,hr,  and  com¬ 
bine  them  into  a  new  simple,  compound,  hypothesis  hi,r.  In  general  h  i  need  not  be  a 
member  of  the  composite  hypotheses  from  which  hi, ...  ,hr  were  selected.  The  com¬ 
pound  hypothesis  hi,r  may  be  substantially  more  complex  than  any  of  the  composite  hy¬ 
potheses  from  which  hi, ...  ,hr  were  selected.  Ensemble  methods  are  a  form  of  model 
combination  where  hi . . .  hr  are  the  base  learners  and  /i1:r  is  the  ensemble  learner.  For 
example,  consider  the  case  where  hi, ...  ,hr  are  linear  discriminants.  The  weighted 
vote  of  linear  discriminants,  hi:r,  is  not  a  linear  discriminant. 

12The  equation  for  the  posterior  predictive  distribution  here  is  the  same  as  Equation  2.10,  except  that  we 
have  made  the  assumption  of  a  hypothesis  explicit  in  the  notation. 
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>  Model  Selection-.  Given  a  collection  of  hypotheses,  Hi, . . . ,  Hk,  choose  the  most  plau¬ 
sible  hypothesis  given  training  data.  Model  comparison  is  a  special  case  of  model  selec¬ 
tion,  where  one  must  determine  if  there  is  a  statistically  significant  difference  between 
two  hypotheses  (c.f.,  hypothesis  testing  [65],  Bayes  factors  [56]). 13 

Linguistically,  the  phrase  “model  averaging”  suggests  a  combination  of  models,  perhaps 
by  averaging  them  together.  Ensemble  methods  like  bagging  have  been  explicitly  com¬ 
pared  to  Bayesian  Model  Averaging  [31].  However,  ensemble  methods  are  a  form  of 
model  combination;  Bayesian  Model  Averaging  is  soft  model  selection  [81],  model  se¬ 
lection  that  takes  the  posterior  uncertainty  into  account.  Comparing  ensemble  methods 
to  Bayesian  Model  Averaging  is  to  draw  a  false  equivalence  between  model  combination 
and  model  averaging:  both  techniques  have  their  merits,  but  they  start  from  very  different 
premises.14 

Consider  the  posterior  predictive  distribution  (Equation  2.18).  If  asymptotic  consis¬ 
tency  holds  within  the  assumed  composite  hypothesis  H,  then  p(()  \  V ,  H)  converges  to 
a  point  mass  as  the  number  of  observations,  the  size  of  V,  tends  towards  infinity.  If  the 
posterior  distribution  over  the  parameters  is  a  point  mass,  then  we  have  selected  a  single 
distribution.  In  Hypothesis-Specific  Bayesian  Model  Averaging,  a  point  mass  posterior  is 
precisely  model  selection.  Uncertainty  in  the  posterior  over  6  reflects  the  inability  to  dis¬ 
tinguish  between  elements  of  H.  If  asymptotic  consistency  does  not  hold,  or  convergence 
occurs  at  a  very  slow  rate  relative  to  the  size  of  V,  then  there  can  be  a  substantial  advantage 
to  Hypothesis-Specific  Bayesian  Model  Averaging  (as  we  shall  see  in  Chapter  5). 15 

In  contrast  to  Bayesian  Model  Averaging,  ensemble  methods  involve  combining  mul¬ 
tiple  base  learners,  even  as  the  training  data  set  grows  arbitrarily  large.  While  model 
combination  is  useful  (especially  when  the  base  learners  are  simple)  our  concern  in  this 
dissertation  is  with  model  averaging  and  model  selection. 


2.5.3  Alternatives  to  Markov  Chain  Monte  Carlo 

Ultimately  Bayesian  learning  is  the  task  of  modeling  the  posterior  distribution  p{()  \  V). 
When  the  posterior  lacks  an  analytic  form,  asymptotically  exact  techniques  must  resort  to 

13Much  of  the  complexity  of  both  Frequentist  and  Bayesian  hypothesis  testing  stems  from  the  subtle 
problem  of  defining  “statistical  significance”. 

14In  theory,  one  could  take  advantage  of  ensemble  methods  by  including  the  compound  hypothesis  formed 
by  the  ensemble  method  into  the  set  of  composite  hypotheses  considered  in  Equation  2.16. 

l5The  same  arguments  can  be  applied  to  Equation  2.16,  where  the  models  are  composite  hypotheses,  if 
one  assumes  the  hypotheses  do  not  overlap. 
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sampling  from  the  posterior.  Markov  Chain  Monte  Carlo  is  a  popular  approach  for  sam¬ 
pling  complex,  high-dimensional  distributions.  There  are,  however,  alternatives  that  ap¬ 
proximate  the  posterior  distribution  using  simpler  distributions,  whose  parameters  can  be 
optimized  to  minimize  the  discrepancy  between  the  approximation  and  the  true  posterior. 
These  techniques  are  known  as  variational  methods.  Variational  Bayesian  Expectation- 
Maximization  [7]  is  an  example  of  a  variational  technique  which  could  be  applied  as  an 
alternative  to  sample-based  Bayesian  inference.  Unlike  sample-based  learning,  variational 
methods  are  asymptotically  approximate:  with  limited  computation,  they  often  outperform 
sample -based  techniques;  but  even  with  infinite  computation,  they  cannot  represent  every 
possible  posterior  distribution. 

In  light  of  the  existence  of  alternatives,  the  reader  may  question  why  we  insist  on 
sampling  in  Chapter  5.  When  variational  methods  work,  they  can  produce  excellent  results 
at  relatively  low  computational  cost.  However,  when  a  Bayesian  model  trained  using 
variational  inference  fails,  it  is  difficult  to  assign  blame  to  the  model,  or  to  the  approximate 
technique  used  to  train  the  model.  In  contrast,  with  sufficient  computation,  we  can  generate 
enough  samples  from  MCMC  to  yield  an  arbitrarily  precise  finite- sample  approximation 
of  the  posterior.  This  allows  us  to  make  statements  about  our  model,  not  just  about  our 
model  combined  with  a  particular  approximate  inference  algorithm. 
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Chapter  3 

Single  Matrix  Factorization 

3.1  Introduction 


The  most  ubiquitous  form  of  relational  data  is  a  single  matrix,  which  represents  a  mapping 
from  pairs  of  arguments  to  a  value:  Relation(x,y)  — >5cl.  The  rows  of  the  matrix 
index  values  of  the  first  argument;  columns  index  values  of  the  second  argument.  The  value 
of  a  matrix  entry  at  the  row  and  column  corresponding  to  entities  x  and  y,  respectively,  is 
the  value  of  the  relation.  In  first-order  logic  the  values  the  mapping  can  take  are  true/false, 
S  =  {0, 1},  but  we  allow  for  more  general  mappings — e.g.,  matrix  entries  may  be  real, 
ordinal,  integral,  etc. 

Matrix  data  shows  up  in  many  different  domains.  In  topic  modeling,  the  input  is  a 
data  matrix  representing  the  relation  Count  (document ,  word)  — >  {0, 1, 2, . . .},  which 
measures  how  often  a  word  occurs  in  a  document.  In  recommendation  systems  the  input 
is  a  data  matrix  representing  the  relation  Rating  (user,  item)  — >  {1,  2, . . . ,  R  e  N}, 
which  measures  the  rating  a  user  assigns  to  an  item  on  an  ordinal  scale.  In  citation  analysis, 
the  input  is  a  data  matrix  representing  the  relation  Cites  (citing,  cited)  — >  {0, 1}, 
which  measures  whether  a  document  cites  another  one. 

The  most  common  tasks  involving  matrix  data  are  entity  clustering  and  prediction.  En¬ 
tity  clustering  involves  grouping  rows  together  by  similarity,  likewise  the  columns.  Pre¬ 
diction  involves  inferring  the  unseen  value  of  a  relation  given  observed  values.  The  unseen 
value  of  a  relation  may  involve  entities  that  appeared  in  the  training  data,  or  entities  that 
did  not  appear  in  the  training  data.  The  latter  scenario  allows  for  relational  models  that 
generalize  to  new  entities. 


33 


An  approach  common  to  entity  clustering  and  prediction  is  low-rank  matrix  factoriza¬ 
tion: 

Definition  1.  Low-rank  matrix  factorization  is  a  statistical  model  that  represents  an  m  x  n 
data  matrix  as  a  function  of  the  product  of  two  lower-rank  factor  matrices:  anmxk  matrix 
U,  and  a  n  x  k  matrix  V.  That  is,  X  &  f(UVT)  for  an  element-wise  transformation 
f  :  M  — >  M  and  k  <  min{m,  n}.  The  parameters  of  the  model  are  (U,  V). 

Over  the  years  a  number  of  low-rank  matrix  factorization  algorithms  have  been  pro¬ 
posed  for  applications  in  text  modeling,  image  analysis,  social  network  analysis,  bioin¬ 
formatics,  etc.  As  a  result,  we  have  a  panoply  of  models  and  no  clear  notion  of  what  the 
important  differences  are  between  them.  This  is  precisely  the  situation  where  a  statistical 
design  pattern  is  of  value.  Independent  of  any  particular  application,  we  want  a  generic 
view  of  the  modeling  choices  common  to  variants  of  low-rank  matrix  factorization.  This 
chapter  presents  a  statistical  design  pattern,  a  unified  view,  for  single  matrix  factorization. 
We  shall  show  that  the  differences  between  many  matrix  factorization  algorithms  can  be 
viewed  in  terms  of  a  small  number  of  modeling  choices.  In  particular,  our  unified  view 
places  dimensionality  reduction  methods,  such  as  singular  value  decomposition  [42],  into 
the  same  framework  as  matrix  co-clustering  algorithms  like  probabilistic  latent  semantic 
indexing  [52]. 

A  limitation  of  matrix  data  is  that  entities  can  participate  in  only  one  relation.  In 
Chapter  4  we  relax  this  restriction,  using  a  low-rank  matrix  factorization  as  a  building 
block  for  collective  matrix  factorization,  which  allows  entities  to  participate  in  more  than 
one  relation. 

Main  Contributions:  The  contributions  of  this  chapter  are  descriptive:  we  present  a  sta¬ 
tistical  design  pattern  for  low-rank  matrix  factorization  which  subsumes  a  wide  variety  of 
matrix  models  in  the  literature,  using  only  a  small  set  of  modeling  choices.  An  advantage 
of  the  statistical  design  pattern  view  is  that  it  allows  us  to  generalize  fast  training  proce¬ 
dures  to  many  different  matrix  factorization  models.  Another  advantage  is  that  our  unified 
view  reduces  the  problem  of  choosing  one  matrix  factorization  model  from  a  set  of  dozens 
of  possible  ones  to  what  we  believe  is  a  simpler  problem:  deciding  on  a  much  smaller 
number  of  modeling  choices. 

Relationship  to  Other  Chapters:  Low -rank  matrix  factorization  is  the  building  block  for 
collective  matrix  factorization,  which  extends  factoring  a  single  matrix  to  factoring  sets  of 
related  matrices,  i.e.,  matrices  which  share  dimensions.  While  we  focus  on  a  particular 
kind  of  matrix  factorization  in  later  chapters,  namely  Exponential  Family  PC  A  [26],  many 
of  the  concepts  are  most  easily  explained  in  the  context  of  single-matrix  factorization. 
This  chapter  focuses  on  the  modeling  aspects:  a  more  detailed  discussion  of  algorithms 
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for  parameter  estimation  is  deferred  to  Chapters  4  and  5. 


3.2  Singular  Value  Decomposition 

The  singular  value  decomposition  (SVD)  is  the  oldest  matrix  factorization  model,  tracing 
its  history  as  an  approximation  technique  to  1907  [122].  SVD  can  be  framed  as  a  bilinear 
form,  where  the  data  matrix  A"  is  approximated  by  the  product  of  low -rank  factors  U  and 
V,  chosen  to  minimize  the  squared-loss: 

n  m 

argmin  Y"  V]  (XtJ  -  UvV^) 2 .  (3.1) 

(U,V) :  UTU=I,VTV=A  ~~l 

The  column- orthonormality  constraint,  UTU  =  /,  and  the  orthogonality  constraint  VTV  = 
A  for  diagonal  A  guarantee  a  unique  non-degenerate  solution,  up  to  a  permutation  of  the 
factors.  Often,  one  drops  the  constraints,  noting  that  any  non-degenerate  solution  (U,V) 
can  be  orthogonalized  by  an  invertible  linear  transform  R  e  Mfcxfc:  (UR.  V i?_1)  has  the 
same  score  as  (U,  V). 

We  illustrate  the  major  design  choices  in  low-rank  matrix  factorization  by  first  consid¬ 
ering  variants  of  the  singular  value  decomposition: 

1.  To  evaluate  Equation  3.1  we  must  know  the  value  of  all  the  entries  of  the  data  ma¬ 
trix  X.  This  obviates  the  use  of  SVD  for  predicting  unseen  entries  of  X,  but  we 
can  still  use  it  for  entity  clustering  (e.g.,  Latent  Semantic  Indexing  [27]),  or  for 
generalizing  to  new  entities.  Moreover,  in  many  of  the  examples  of  data  matri¬ 
ces  we  considered,  one  observes  only  a  small  fraction  of  the  entries  of  X:  e.g., 
Rating  (user,  item)  is  usually  only  observed  for  a  small  fraction  of  all  pos¬ 
sible  (user,  item)  pairs.  Data  weights  allow  matrix  factorizations  to  work  around 
these  problems.  Assume  we  are  given  Wtj  >  0  for  each  XV) .  The  loss  for  weighted 
SVD  [36,  119]  is 

n  m 

argmin  V  V  Wt]  (XtJ  -  U,Vjrf .  (3.2) 

(uy)  ^  ^ 

When  Wij  =  0,  the  corresponding  entry  of  A  makes  no  contribution  to  the  objective, 
and  whatever  value  is  assigned  to  Ay-  has  no  influence  on  the  model.  Setting  data 
weights  to  zero  allows  for  missing  values  in  the  data. 

2.  For  SVD,  A"  «  UVT ,  where  the  entries  of  U  and  V  are  arbitrary  real  numbers.  If  X%J 
is  a  real  number,  then  there  is  at  least  a  prima-facie  case  for  this  model.  However, 
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if  Xij  contains  integral  counts,  binary  values,  or  ordinal  scales,  then  the  model  may 
predict  a  value  which  is  impossible  given  the  interpretation  of  a  value  of  X^.  For  ex¬ 
ample,  if  Xij  are  true/false  (1/0),  the  model  allows  for  predictions  like  Xl3  =  2.  One 
technique  for  enforcing  constraints  on  values  of  a  relation  is  to  introduce  a  predic¬ 
tion  link,  /,  which  transforms  UVT  into  prediction  in  the  data  domain.  Typically,  / 
is  a  scalar  function  applied  to  each  element  of  UVT .  An  example  of  /  is  the  sigmoid 
function,  which  maps  the  real-line  onto  the  [0, 1]  interval. 

3.  In  certain  scenarios  we  wish  to  assign  an  interpretation  to  the  low-rank  factors.  For 
this  purpose,  we  may  impose  hard  constraints  on  the  factors.  A  common  example  is 
clustering  or  factor  analysis,  where  each  of  the  k  columns  of  U  (likewise  V)  corre¬ 
spond  to  a  topic  or  factor,  and  the  entries  of  a  row  of  U,  (Ut\, . . . ,  Ulk),  correspond 
to  the  degree  an  entity  is  described  by  a  particular  topic  or  factor.  In  clustering  mod- 
els,  Uik  =  1  A  Ulk  >  0,  which  is  a  hard  constraint  on  the  factor.  Such  hard 
constraints  can  dramatically  increase  the  complexity  of  parameter  estimation. 

4.  Even  in  the  absence  of  hard  constraints  we  must  acknowledge  that  matrix  factor¬ 
izations  contain  a  large  number  of  parameters:  k  parameters  for  every  entity  being 
modelled.  The  problem  is  exacerbated  by  missing  values.  Adding  a  regularization 
penalty  to  the  objective  mitigates  overfitting.  For  example,  one  may  choose  to  add 
/2 -regularization: 

n  m  n  k  m  k 

argmin  ]T  [Xij  ~  +  XvJ2J2  Vh 

( U'V )  i=  1  j= 1  i=  1  £=1  j= 1  £=1 

where  Xu  >  0,  Ay  >  0  control  the  relative  penalty  for  large  parameters.  Another 
approach  to  regularization  allows  k  to  be  large,  even  larger  than  max{m,  n},  but 
places  a  penalty  on  a  continuous  proxy  for  the  rank  of  UVT,  the  trace-norm  [120]: 

n  m 

argmin  V'  V]  (Xtj  -  Ui.Vf)2  +  A  •  ir{UVT). 

VX)  i=l  J=i 

5.  Another  limitation  of  the  singular  value  decomposition  is  the  squared-loss  objective 
itself.  Even  if  we  allow  for  nonlinear  prediction  links,  X  =  f(UVT),  the  square 
loss  itself  may  be  undesirable.  The  squared-loss  is  not  robust  to  outliers:  a  single, 
sufficiently  large,  outlier  of  X  can  dominate  the  model.  One  approach  is  to  minimize 
under  ii  loss  [57]: 

n  m 

argmin  V'  V'  \Xi:j  -  Ui.V? I  . 

(uy) 
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There  are  many  reasons  for  supporting  a  variety  of  measures  of  reconstruction  error, 
or  loss  functions,  between  A"  and  X  =  f(UVT).  We  discuss  the  connection  between 
loss  functions  and  prediction  links  in  Section  3.4. 

6.  All  of  the  techniques  we  have  discussed  are  point  estimators  that  fall  under  the  rubric 
of  maximum  likelihood  or  regularized  maximum  likelihood.  A  fully  Bayesian  per¬ 
spective  requires  modelling  the  posterior  distribution  of  the  model  given  the  data, 
p(U,V  |  A").  Because  the  majority  of  the  literature  is  concerned  with  the  max¬ 
imum  likelihood  case,  we  defer  a  discussion  of  Bayesian  matrix  factorization  to 
Section  3.10. 

These  six  modeling  choices  are  not  mutually  exclusive;  the  variety  of  models  for  low-rank 
matrix  factorization  is  largely  due  to  researchers  exploring  the  Cartesian  product  of  these 
modeling  choices. 


3.3  Data  Weights 

The  great  advantage  of  the  singular  value  decomposition  is  the  existence  of  a  single  global 
optimum,  which  is  easily  computed  or  approximated  for  large  matrices  [42].  However, 
even  a  seemingly  innocuous  change  to  the  SVD  model  can  lead  to  a  significantly  more 
difficult  optimization.  Nowhere  is  this  more  apparent  than  in  weighted  SVD  [119],  a.k.a. 
criss-cross  regression  [36]. 

The  only  difference  between  SVD  and  Weighted  SVD  is  the  addition  of  data  weights: 
compare  Equations  3.1  and  3.2.  Adding  data  weights  significantly  complicates  training: 
the  objective  can  have  multiple  local  optima,  and  one  usually  resorts  to  general  purpose 
nonlinear  optimization.  Essentially  all  low-rank  matrix  factorizations,  save  SVD,  can  (and 
typically  do)  have  multiple  local  optima  in  their  objectives. 


3.4  Prediction  Links  and  Matching  Losses 

The  singular  value  decomposition  assumes  that  each  entry  of  the  data  matrix  is  Gaus¬ 
sian,  where  the  corresponding  entry  in  0  =  UVT  is  the  natural  parameter:  Xl3  ~ 
Gaussian  (By).  In  this  section,  we  discuss  the  close  relationship  between  the  choice  of 
prediction  link,  the  choice  of  loss  function,  and  distribution  assumptions  made  by  low- 
rank  matrix  factorizations.  We  begin  by  noting  that  the  Gaussian  is  an  exponential  family 
of  distributions: 
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Definition  2.  A  parametric  family  of  distributions  ifjp  =  {pp(x\9)  :  9}  is  a  regular 
exponential  family  if  each  density  pp  can  be  expressed  as  the  following  canonical  form: 

log pF(x  |  9)  =  logp0(x)  +  (9,  x )  -  F(9 ), 

where  9  is  the  vector  of  natural  parameters  for  the  distribution,  x  is  the  vector  of  minimal 
sufficient  statistics,  and  F{6 )  =  log  J  pfx)  exp ((<9,  x))  dx  is  the  log-partition  function. 
po(x)  is  a  base  measure,  independent  of  the  parameters.  A  distribution  in  fip  is  uniquely 
identified  by  its  natural  parameters. 

The  measure  of  error  of  a  matrix  factorization  model  is  usually  evaluated  in  the  data 
space,  i.e.,  using  data  X  and  estimates  X  =  f(UVT).  Characterizing  matrix  factorizations 
as  statistical  models  allows  one  to  measure  error  in  the  parameter  space,  i.e.,  the  likelihood 
of  data  X  given  model  0  =  UVT .  In  the  following  section  we  show  that,  when  Xrj  ~ 
i/jp(Qij  =  Ui-XX),  measuring  error  in  data  space  is  equivalent  to  measuring  it  in  parameter 
space. 

3.4.1  Bregman  Divergence 

Almost  every  loss  function  used  in  the  literature  on  low-rank  matrix  factorization  is  a  Breg¬ 
man  divergence.  In  this  section  we  introduce  Bregman  divergences,  and  their  relationship 
to  exponential  families. 

Definition  3  (Generalized  Bregman  Matrix  Divergence  [44,  45]).  For  a  closed,  proper, 
convex  function  F  :  Mmxn  — »■  M  the  generalized  Bregman  divergence  [44,  45]  between 
matrices  0  and  X  is 


OF(0  ||  X)  =  F(0)  +  F*(X)  -  X  o  0, 
where  F*  is  the  convex  conjugate,  F*(p )  =  sup0edomF  |0o/i  -  F(0)].* 

Definition  4  (Generalized  Weighted  Bregman  Divergence).  For  a  closed,  proper,  convex 
function  F  :  M  — >  M  and  constant  weight  matrix  W  G  the  generalized  weighted 

Bregman  divergence  is 

D F{&  ||  X,W)  =  J2  wij  +  F*(xa )  -  xa®ij)  ■ 

ij 


1  We  remind  the  reader  that  A  o  B  denotes  the  matrix  inner  product  j  AijBij. 
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Definition  5  (Bregman  Divergence  [15,  20]).  For  a  closed,  proper,  differentiable  convex 
function  F  :  M  — >  M,  where  the  gradient  XF  —  f,  and  Wij  =  1,  the  Bregman  divergence 
is 

DF.(X\\fm  =  Y,F‘( Xtj)  -  F*(/(0«))  -  VF*(/(0 '„))(*«  -  /(0«)). 

ij 

This  definition  is  a  special  case  ofDp(Q  \  \  X ,  ID). 

Each  of  the  Bregman  divergences  defines  a  measure  of  dissimilarity,  in  some  cases  a 
weighted  measure,  between  the  parameter  matrix  0  and  the  data  matrix  X.  Definition  3 
is  included  for  completeness.  The  divergences  Definition  3  defines  are  not  commonly 
used  in  matrix  factorization  (e.g.,  von-Neumann  divergence  [61,  128]).  Definition  5  is  the 
standard  definition  presented  in  the  literature,  which  assumes  that  the  generating  function 
F  is  differentiable.  Definition  4  is  the  one  we  use  in  this  thesis:  F  is  allowed  to  be  non- 
differentiable,  and  weighted  divergences  are  required  for  matrix  factorizations  that  use 
data  weights. 

There  is  a  close  relationship  between  Bregman  divergences  and  regular  exponential 
families: 


log pF(x  |  9)  =  log pF(x)  +  F*(x)  -  Df*(x  ||  f(9)), 

where  the  prediction  link  f(6 )  =  V F(0)  is  known  as  the  matching  link  for  F  [3,  5,  26,  35]. 
The  log-partition  function  which  defines  the  exponential  family  ipF,  implicitly  defines  a 
prediction  link,  /  =  VF,  and  a  matching  loss  function,  DF*{x  ||  f(9)).  In  theory,  one 
could  pick  the  prediction  link  /  such  that  it  does  not  relate  to  the  log-partition  function, 
F.  In  practice,  the  link  and  the  loss  almost  always  match,  since  this  guarantees  that  the 
resulting  loss  is  convex  in  0.  To  emphasize  the  importance  of  matching  links  and  losses, 
our  definitions  of  generalized  Bregman  divergence  force  the  link  and  the  loss  to  match. 
Therefore,  any  one  of  the  following  modelling  choices  is  equivalent  to  deciding  the  other 
two: 

1.  Choice  of  regular  exponential  family,  X,3  ~  (©?//)• 

2.  Choice  of  prediction  link,  /  =  VF,  where  X  =  f{UVT). 

3.  Choice  of  matching  loss,  0^(0  1 1  X,  W). 

Conceptually,  it  is  easiest  to  decide  that  the  entries  of  the  data  matrix  are  drawn  from  a 
particular  exponential  family:  e.g.,  Gaussian,  Poisson,  Bernoulli.  The  prediction  link  and 
matching  loss  follow  automatically. 
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The  relationship  between  matrix  factorization  and  exponential  families  is  made  clear 
by  viewing  the  data  matrix  as  a  collection  of  samples:  {Xu, . . . ,  Xrnn}.  Let  0  =  UVT  be 
a  matrix  of  natural  parameters.  If  we  assume  that  XtJ  =  Ui.V?),  then 

{U*,V*)  =  aigmmDF.{X\\f(Q))  (3.3) 

(u,v) 

yields  the  maximum  likelihood  solution.  Collective  Matrix  Factorization  is  built  on  ma¬ 
trix  factorizations  where  the  link  and  loss  always  match,  and  data  weights  are  required. 
Therefore,  we  prefer  framing  the  maximum  likelihood  optimization  as 


([/*,  V*)  =  argminOF(X  ||  UVT :  W).  (3.4) 

(u,v) 

The  generalization  of  the  singular  value  decomposition  to  non-Gaussian  data  distributions, 
using  Equation  3.3,  is  known  as  Exponential  Family  PCA  (E-PC A)  [26]. 


Relationship  to  Generalized  Linear  Models 

The  relationship  between  SVD  and  E-PCA  is  analogous  to  the  relationship  between  lin¬ 
ear  regression  and  generalized  linear  models  (GLMs)  [78].  In  the  regression  case,  the 
main  difference  between  the  two  types  of  models  is  that  linear  regression  assumes  that 
the  response  is  Gaussian;  generalized  linear  models  assume  the  response  is  modelled  by 
a  regular  exponential  family.  Both  regressions  are  linear  models.  Linear  regression  has 
a  closed  form  maximum  likelihood  solution;  iteratively-reweighted  least  squares  yields 
the  maximum  likelihood  solution  for  GLMs.  In  the  matrix  factorization  case,  the  main 
difference  between  the  SVD  model  and  E-PCA  model  is  that  the  former  assumes  the  ma¬ 
trix  entries  are  Gaussian;  the  latter  assumes  the  matrix  entries  are  drawn  from  a  regular 
exponential  family.  Both  matrix  factorizations  are  bilinear  models.  In  both  matrix  factor¬ 
izations,  the  underlying  optimization  is  non-convex.  However,  in  the  case  of  SVD  there 
is  a  unique  local  optimum;  in  E-PCA  the  non-convex  optimization  typically  has  multi¬ 
ple  local  optima.  While  the  singular  value  decomposition  can  be  solved  as  an  eigenvalue 
problem,  even  for  large  sparse  matrices,  the  same  cannot  be  said  for  E-PCA.  Almost  every 
variant  of  maximum  likelihood  low-rank  matrix  factorization,  other  than  SVD,  is  trained 
using  a  general  purpose  nonlinear  optimization  (e.g.,  Expectation-Maximization,  Gradient 
Descent,  Newton’s  Method). 
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3.5  Parameter  Estimation 


To  recap,  all  maximum  likelihood  matrix  factorizations  of  X  e  Mmxn  can  be  differentiated 
by  the  following  choices: 

1.  Data  weights  W  e  I“xn. 

2.  Prediction  link  /  :  Kmxn  — >  Wnxn.  The  prediction  link  is  usually  an  element-wise 
operator  on  its  inputs. 

3.  Hard  constraints  on  factors,  (U,  V )  6  C.  When  the  constraints  for  each  factor  are 
independent  of  each  other,  we  denote  them  Cjj  and  Cv. 

4.  Weighted  divergence  or  loss  between  A"  and  X  =  f(UVT ):  V(X\\X,  W)  >  0. 

5.  Regularization  penalty,  TZ (U,  V)  >  0. 

Learning  the  model  A"  «  f(UVT )  reduces  to  the  following  optimization: 

(U*,  V*)  =  argmin  {V(X\ \f{UVT),  W)  +  K(U,  V )}  .  (3.5) 

( u,v)ec 

The  objective  in  Equation  3.5  is  almost  always  non-convex  in  (U,  V ).  Moreover,  with  the 
exception  of  the  singular  value  decomposition,  there  can  be  many  local  optima.  The  user 
is  forced  to  resort  to  a  variety  of  non-linear  optimization  techniques:  gradient  descent, 
conjugate  gradients,  and  expectation-maximization.  In  certain  special  cases  more  esoteric 
solvers  can  be  applied:  non-negative  matrix  factorization  uses  a  multiplicative  update. 
Trace-norm  regularization  leads  to  a  semidehnite  program. 

Alternating  projections  is  an  optimization  technique  we  pay  significant  attention  to  in 
later  chapters.  Matrix  factorization  is  (generalized)  bilinear:  the  model  f(UVT)  is  linear 
in  U  when  V  is  fixed,  and  vice-versa.2  This  leads  to  classes  of  block  coordinate  descent 
algorithms  where  the  optimization  cycles  between  updates  of  the  factors: 

U(t)  =  argmin  {v  (x\\f  (u  ,  w)  +  TZ  (U,  l/(t_1))  j  ,  (3.6) 

u&c(u)  f  V  V  J  J  ) 

V(t)  =  argmin  {V  (X\\f  (U{t)VT)  ,  W)  +  U  (U{t\  H)}  .  (3.7) 

vec(v) 

Each  of  Equation  3.6  and  3.7  is  known  as  a  projection.  Choosing  a  good  projection  algo¬ 
rithm  is  critical  to  achieving  fast  convergence  to  a  local  optimum. 

2The  phrase  bilinear  refers  to  the  form  of  the  parameters.  Even  if  the  prediction  link  /  is  a  nonlinear 
function,  UVT  remains  a  bilinear  form.  Our  use  of  “bilinear”  is  analogous  to  the  use  of  “linear”  in  linear 
models,  such  as  generalized  linear  models  and  wavelets,  where  linearity  refers  to  the  basis. 
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3.5.1  Structure  in  Alternating  Projections 

Low-rank  matrix  factorization  contains  rich  independence  structure  among  the  parameters, 
almost  none  of  which  is  taken  advantage  of  by  general  purpose  solvers  such  as  gradient 
descent.  One  example  of  structure  in  matrix  factorization  is  decomposability  of  the  loss 
function,  namely  the  fact  that  the  error  between  X  and  X  =  f(UVT)  can  be  expressed 
as  the  sum  of  the  errors  between  entries  X%3  and  Xl3 .  All  of  the  losses  in  Table  3.1  are 
decomposable. 

The  advantage  of  decomposable  losses  in  alternating  projection  is  that  when  the  one  of 
the  factors  U  or  V  is  fixed,  the  update  over  the  other  factor  can  be  decomposed  into  parallel 
updates  over  each  row  of  the  factor.  If  we  can  evaluate  the  loss,  the  regularizer,  and  the 
hard  constraints  on  rows  of  each  factor,  then  the  per-factor  projections  in  Equations  3. 6-3. 7 
can  be  reduced  into  parallel  projections  over  the  rows  of  each  factor: 

Vi  =  1 . . .  n  :  U}t}  =  argmin  \v{Xi.  1 1  /  (uv  ,  w)  +  7^  (Uv,  V (*"1})  }  , 

Ui-  £.C(Ui. ) 

(3.8) 

Vj  =  1 . . .  m  :  =  argmin  {V  (X.j  \  \  f  {U{t)V^  W))  +TZ  (U{t\  Vj) }  .  (3.9) 

Vj.eC(Vj.) 

These  operations  are  parallel  in  the  sense  that  the  Vi  and  Vj  statements  can  be  evaluated 
as  a  parallel-for  loop.  In  essence,  we  have  broken  down  a  projection  over  mk  (or  nk ) 
parameters  into  m  (or  n)  separate  projections  over  k  parameters.  For  many  choices  of 
loss  V,  constraints  C,  and  regularizer  TZ,  these  projections  correspond  to  widely  used  lin¬ 
ear  models.  For  example,  under  unweighted  squared-loss,  with  no  hard  constraints,  and 
regularization  the  per-row  projections  correspond  to  ^-regularized  linear  regression  where 
the  fixed  factor  acts  as  the  covariate.  In  many  cases,  Equations  3. 8-3. 9  are  convex  opti¬ 
mizations.  ^-regularized  E-PCA  is  an  example  where  the  projections  are  convex.  The 
per-row  updates  can  be  framed  as  single  steps  of  Newton-Raphson,  as  in  Gordon  [45]. 

3.5.2  Row-Column  Exchangeability 

The  reader  may  rightly  question  why  decomposable  losses  are  so  common  in  low-rank 
matrix  factorization.  Certainly,  there  are  computational  advantages  to  decomposability. 
However,  there  is  also  a  case  to  be  made  for  such  losses  independent  of  any  computational 
considerations.  The  following  presentation  is  based  on  [2,  51]. 

Fet  us  view  the  data  matrix  Xmxn  as  a  realization  of  any  underlying  matrix  Zrnxn  plus 
a  noise  matrix  Emxn,  where  Z  =  f(UVT).  Z  is  the  noise-free  matrix  underlying  A",  and 
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its  entries  can  be  viewed  as  random  variables.  We  say  that  the  random  variables  { Zl3 }  are 
row-column  exchangeable  (or,  Aldous  exchangeable),  if  the  likelihood  of  Z  is  invariant  to 
permutations  of  the  rows  and  columns: 

p(Zn  .  .  .  Zmn  |  A.  11  .  .  .  Xmn )  p(Z ^7r(l)7f(l)  •  •  •  Z7 r(m)7r(n)  |  -^7r(l)if(l)  •  •  •  -^7r(m)#(n) )  j 

'  (3.10) 

where  tt  is  a  permutation  of  the  rows  of  X  and  Z,  and  likewise,  it  is  an  independent  per¬ 
mutation  of  the  columns  of  X  and  Z.  Colloquially,  Aldous  exchangeability  means  that 
the  identity  of  a  row  (or  column)  contains  no  information  about  Z,  and  that  the  relative 
position  between  two  rows  (or  two  columns)  contains  no  information  about  Z.  Aldous  ex¬ 
changeability  is  an  extension  of  de  Finetti  exchangeability  to  matrices  of  random  variables, 
and  like  de  Finetti  exchangeability,  it  leads  to  a  representation  theorem: 

Theorem  1  (Aldous’  Theorem  [1]).  If  Z  is  row-column  exchangeable,  then  there  exists 
a  function  g  and  independent  uniformly  distributed  random  variables  /i,  { u  \ , . . .  ,um}, 
{t’i , . . . ,  vn},  and  {£ij}  such  that 


did:  ttii  Vji  E-ij')- 


A  consequence  of  Theorem  1  is  that  any  statistical  model  of  a  row-column  exchange¬ 
able  matrix  can  be  parameterized  by  a  global  effect  //,  a  row  effect  u,,  a  column  effect  vv 
and  dyadic  effect  ei3 .  Moreover,  the  dyadic  effects  are  independent  of  one  other,  which 
naturally  leads  to  decomposable  objectives  for  matrix  factorization.  It  should  be  noted 
that  Theorem  1  is  descriptive,  not  prescriptive  (just  like  de  Finetti  exchangeability):  no 
particular  form  of  g  is  implied,  and  so  this  theorem  provides  no  especial  justification  for 
the  bilinear  form  f{UVT). 

If  we  consider  some  of  the  applications  of  matrix  factorization,  row-column  exchange¬ 
ability  is  a  reasonable  assumption.  In  text  modelling,  each  row  of  the  matrix  is  a  document, 
each  column  a  word,  and  each  entry  a  word  frequency  count.  The  relative  position  of  two 
documents,  or  two  words,  in  the  matrix  is  arbitrary.  If  we  permute  the  rows  of  X  and  the 
rows  of  U  in  the  same  manner,  the  likelihood  of  the  model  does  not  change.  In  collabora¬ 
tive  filtering,  the  relative  position  of  users  and  items  in  the  data  matrix  is  not  informative. 
Row-column  exchangeability  is  less  plausible  when  a  dimension  corresponds  to  time,  or  a 
spatially-varying  quantity.  The  modeler  may  choose  to  ignore  non-exchangeability  in  the 
data  matrix  to  reap  the  computational  advantages  of  decomposability. 
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3.6  Regularization 


The  objective  in  Equation  3.4,  like  many  low-rank  matrix  factorization  losses,  will  overfit. 
The  number  of  parameters  grows  with  the  number  of  entities,  and  amount  of  data  available 
for  each  entity  can  be  bounded,  e.g.,  when  new  rows  can  be  added  to  the  matrix,  but  the 
set  of  columns  is  fixed.  As  a  result,  there  is  the  need  for  regularization.  We  typically  use  a 
separate  regularizer  for  each  factor,  so  7Z(U.  V )  =  7 Z(U)  +  'IZ(  V).  The  most  common  ma¬ 
trix  regularizes  are  simple  extensions  of  those  used  in  regression,  namely  tp  regularizes: 
7 Z(U)  oc  A  Yhij  \uij\p >  where  A  controls  the  strength  of  the  regularizer.  From  a  computa¬ 
tional  perspective,  £p  regularizes  have  the  advantage  of  preserving  decomposability  of  the 
objective: 

nu)  =  £  War  =  £  £  I utj\”  =  y]  nu,.)- 

ij  i  j  i 

In  this  thesis  we  use  £2 -regularization:  1Z{U)  =  \\\U\\2Fro/2.  £2  regularizes  have  the 
merits  of  being  differentiable,  decomposable,  and  easily  related  to  a  Gaussian  prior  in  the 
Bayesian  matrix  factorization. 

In  contrast,  consider  a  non-decomposable  regularizer:  trace-norm  tr(UVT).  Any  opti¬ 
mization  using  the  trace  norm  cannot  be  easily  decomposed,  since  the  trace-norm  depends 
on  the  product  of  U  and  V.  In  fact,  if  we  consider  the  difference  between  max-margin 
matrix  factorization  [120]  and  its  fast  analogue  [104],  the  only  difference  in  the  models 
is  that  the  fast  analogue  replaces  trace-norm  regularization  with  ^-regularization,  allow¬ 
ing  a  hard-to-solve  semidefinite  program  to  be  replaced  by  a  straightforward  nonlinear 
optimization. 

As  with  linear  models,  one  may  prefer  to  use  a  sparse  i\  regularizer,  which  can  be 
reduced  to  an  inequality  constraint  on  each  row  of  the  factors:  |  Lj[  "Hi  <  fj.  Since 
the  i\ -regularizer  is  decomposable,  we  can  use  Equations  3. 8-3. 9,  which  leads  to  an  i\- 
regularized  regression  problem.  When  the  V  is  a  Bregman  divergence,  the  per-row  pro¬ 
jections  are  simply  £1  -regularized  generalized  linear  models.  Therefore,  we  can  exploit  a 
variety  of  approaches  for  7j -regularized  regression  in  matrix  factorization.  We  refer  the 
reader  to  Schmidt  et  al.  [113]  for  a  survey  of  techniques  for  t\ -regularized  regression. 


3.7  Constraints 

Inequality  constraints  turn  the  projection  into  a  constrained  optimization.  -regularization 
is  one  example  of  inequality  constraints.  Non-negative  matrix  factorization  is  another  ex- 
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ample  of  such  constraints,  where  the  commonly-used  multiplicative  updates  guarantee  that 
the  solution  is  feasible  at  each  iteration.  Linear  constraints  can  be  used  to  place  matrix 
co-clustering  into  the  same  statistical  design  pattern  as  matrix  factorization.  With  no  con¬ 
straints  on  the  factors,  matrix  factorization  can  be  viewed  as  factor  analysis:  an  increase  in 
the  influence  of  one  latent  variable  (column  of  U  or  V )  does  not  require  a  decrease  in  the 
influence  of  the  other  latents.  In  clustering  the  stochastic  constraints, 

Vi  ^  Uij  =  1  A  ViVj  Uij  >  0, 

3 

Vj  ^2  Vij  =  1  A  ViVj  Vij  >  0, 

i 

imply  that  each  row  (column)  of  the  data  matrix  must  belong  to  one  of  k  latent  clusters, 
each  corresponding  to  a  column  of  U  ( V ),  where  t/j.  (Vj.)  is  a  distribution  over  cluster 
membership  for  that  entity.  In  matrix  co-clustering,  stochastic  constraints  are  placed  on 
both  factors,  since  the  goal  is  to  simultaneously  cluster  both  rows  and  columns. 

3.7.1  Bias  Terms 

Aldous’  theorem  (Theorem  1)  suggests  that  there  may  be  an  advantage  to  modeling  per- 
row  and  per-column  behaviour.  For  example,  in  collaborative  filtering,  bias  terms  can 
calibrate  for  a  user’s  mean  rating.  A  straightforward  way  to  account  for  bias  is  to  append 
an  extra  column  of  parameters  to  U  paired  with  a  constant  column  in  V:  U  —  [U  Uk+i] 
and  V  —  [V  1],  We  do  not  regularize  the  bias.  It  is  equally  straightforward  to  allow  for 
bias  terms  on  both  rows  and  columns:  U  —  [U  Uk+i  1]  and  V  =  [V  1  Vk+i],  and  so 
UVT  =  (U VT)  +  Uk+11T  +  Note  that  these  are  biases  in  the  space  of  natural 

parameters,  a  special  case  being  a  margin  in  the  hinge  or  logistic  loss — e.g.,  the  per-row 
(per-user,  per-rating)  margins  in  MMMF  [104]  are  just  row  biases. 


3.8  Models  subsumed  by  the  unified  view 

The  beauty  of  reducing  maximum  likelihood  low -rank  matrix  factorization  into  a  statistical 
design  pattern  is  that  we  can  reduce  the  complexity  of  choosing  between  many  different 
models  into  five  modeling  choices.  Table  3.1  lists  a  few  common  instances  of  low-rank 
matrix  factorization,  along  with  how  they  can  be  reduced  into  a  small  number  of  modeling 
choices.  As  we  will  see  in  Section  3.11,  these  dimensions  of  variability  provide  the  basis 
for  a  statistical  design  pattern. 
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Table  3.1:  Single  matrix  factorization  models,  dorri  Xxj  describes  the  types  of  values 
allowed  in  the  data  matrix.  Unweighted  matrix  factorizations  are  denoted  —  1.  If 
constraints  or  regularizers  are  not  used,  the  entry  is  marked  with  a  em-dash.  Ao  B  denotes 


the  matrix  inner  product:  tr  (ATB).  Aq  B  denotes  the  element-wise  (Hadamard)  product. 

Method 

dom  Xij 

Link  f(0) 

LossV(X\X  =  f(Q),W) 

Wij 

SVD  [42] 

R 

e 

\\W  @  (X  —  x)\\2Fro 

1 

W-SVD  [36,  119] 

R 

e 

\\WQ(X-X)\\2Fro 

>  0 

fc-means  [48] 

R 

e 

\\W  ©  (X-X)\\2Fro 

1 

/c-medians 

R 

e 

Ey  \Wij  (Xij  -  xJ  1 

1 

C-SVD  [57] 

R 

e 

Ey  \Wn  [xtj  -  Xu)  | 

>  0 

pLSI  [52] 

lol  =  1 

e 

E„w«(x«iog^) 

1 

NMF  [63] 

R+ 

e 

Ey^^-log^  +  ey-^) 

1 

<2-NMF  [91,63] 

R+ 

e 

\\W  ®  (X  -  X)\\%ro 

1 

Logistic  PCA  [112] 

{0,1} 

(l  +  e-*)-1 

EijWij  (xtJ  log  ^  + 

1 

E-PCA  [26] 

many 

many 

decomposable  Bregman  (Dp) 

1 

G2L2M  [45] 

many 

many 

decomposable  Bregman  (Of) 

1 

MMMF  [120] 

{0 

min-loss 

Er=l  EyiXy^O  Wij  '  ""  Bir) 

1 

Fast-MMMF  [104] 

{0 

min-loss 

Er=l  Ylij-.Xijjt 0  '  ^7  (®*7  —  Bir) 

1 

Method 


Constraints  U 


Constraints  V 


Regularizer  1Z(U,V) 


Algorithm(s) 


SVD 

UTU  =  I 

can  be  applied  post-hoc 

VTV  =  A 

can  be  applied  post-hoc 

— 

Gaussian 

Elimination,  Power 
Method 

W-SVD 

— 

— 

— 

Gradient,  EM 

fc-means 

— 

VTV  =  I 

Vij  >  o 

— 

EM 

/c-medians 

— 

VTV  =  I 

Vij  >  0 

— 

Alternating 

C-SVD 

— 

— 

— 

Alternating 

pLSI 

1TU1  =  1 

Uij  >  0 

1TV  =  1  Vij  >  0 

— 

EM 

NMF 

Uij  >  0 

Vij  >  0 

— 

Multiplicative 

£2 -NMF 

Uij  >  0 

Vij  >  o 

— 

Multiplicative, 

Alternating 

Logistic  PCA 

— 

— 

— 

EM 

E-PCA 

— 

— 

— 

Alternating 

G2L2M 

decomposable  Bregman 
Bf(-||[/)+Bf(-||U) 

Alternating 

(Subgradient, 

Newton) 

MMMF 

— 

— 

tr(UVT) 

Semidefinite 

Program 

Fast-MMMF 

— 

— 

imwiro+rnWro) 

Conjugate  Gradient 

In  light  of  Equations  3. 8-3. 9,  a  great  many  matrix  factorizations  can  be  seen  as  exten¬ 
sions  of  linear  models  to  bilinear  forms: 

>  SVD  is  the  matrix  analogue  of  linear  regression. 

>  E-PCA  is  the  matrix  analogue  of  generalized  linear  models. 

>  G2L2M  is  the  matrix  analogue  of  a  regularized  generalized  linear  model. 

>  MMMF  is  the  matrix  analogue  of  ordinal  regression  under  hinge  loss. 

>  Fast-MMMF  is  the  matrix  analogue  of  ordinal  regression  under  a  smooth  alternative  to 
hinge  loss,  /i7  (e.g.,  a  smoothed  hinge  loss,  or  logistic  loss). 

>  A:- medians  is  the  matrix  analogue  of  quantile  regression  [58],  where  the  quantile  of 
interest  is  the  median. 

>  Robust  SVD  [69]  is  the  matrix  analogue  of  the  FASSO  [126]. 

The  key  difference  between  a  regression/clustering  algorithm  and  its  matrix  analogue  is 
that  the  exchangeability  (or  strong  iid)  assumption  is  replaced  by  row-column  exchange¬ 
ability. 

Another  useful  aspect  of  our  unified  view  is  that  it  is  relatively  easy  to  capture  similari¬ 
ties  between  matrix  factorizations.  For  example,  the  relationship  between  pFSI  and  NMF, 
first  derived  by  [28],  becomes  straightforward  when  one  compares  the  two  algorithms  in 
Table  3.1.  The  prediction  link  is  the  identity  function  in  pFSI  and  NMF.  The  loss  for  pFSI 
is  KF-divergence,  but  unnormalized  KF-divergence  for  NMF.  The  two  losses  differ  by  an 
additive  JA  ■  0l?  —  XtJ  factor.  In  NMF,  if  we  constrain  the  entries  of  X  to  sum  to  one,  and 
likewise  constrain  the  entries  of  0  to  sum  to  one,  then  NMF  is  the  same  as  pFSI.  Adding 
an  orthogonality  constraint  to  (2-NMF  yields  a  soft  clustering  variant  of  fc-means. 

The  careful  application  of  constraints  can  be  used  to  fit  clustering  algorithms  into  our 
framework:  Orthogonality  of  the  column  factors  VTV  =  I  along  with  integrality  of  Vr] 
corresponds  to  hard  clustering  the  columns  of  X,  as  at  most  one  entry  in  Vt.  can  be  non¬ 
zero.  In  the  A-mcans  algorithm,  U  contains  the  cluster  memberships,  and  V  is  a  normalized 
version  of  the  cluster  centroids  [29].  The  rank  of  the  decomposition  and  the  number  of 
clusters  in  A:- means  is  k.  Alternating  projections  corresponds  to  the  classical  approach  to 
clustering:  updating  U  corresponds  to  assigning  points  to  clusters;  updating  V  corresponds 
to  updating  the  cluster  centroids. 
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Max-margin  matrix  factorization  is  one  of  the  more  elaborate  models:  ordinal  ratings 
{1, . . . ,  R }3  are  modeled  using  R  —  1  parallel  separating  hyperplanes,  corresponding  to  the 
binary  decisions  Xl3  <  1 ,  XtJ  <  2,  Xi:i  <  3, ... ,  Xi3  <  R  —  1.  The  per-row  bias  term  B„ 
allows  the  distance  between  hyperplanes  to  differ  for  each  row.  Since  this  technique  was 
conceived  for  user-item  matrices,  the  biases  capture  differences  in  each  user.  Predictions 
are  made  by  choosing  the  value  which  minimizes  the  loss  of  the  R—  1  decision  boundaries, 
which  yields  a  number  in  {1, . . . ,  f?}  instead  of  R. 


3.8.1  Extensions  of  the  Unified  View 

Our  unified  view  assumes  a  bilinear  form,  UVT ,  which  is  common  in  the  literature  on  ma¬ 
trix  factorization.  However,  there  is  little  reason  why  we  cannot  extend  our  consideration 
of  low-rank  matrix  factorization  to  multilinear  forms:  e.g.,  UAVT,  where  U  and  V  can 
have  different  numbers  of  factors,  and  A  is  a  transformation  between  low-rank  subspaces. 
Moving  from  a  bilinear  to  a  multilinear  form  is  essentially  a  change  in  the  structure  of  the 
underlying  probabilistic  graphical  model:  the  same  five  choices  can  be  used  to  differentiate 
models  that  share  a  particular  multilinear  form. 

Moving  towards  a  multilinear  form  naturally  leads  one  to  consider  tensor  factorization: 
e.g.,  UAVT  is  a  special  case  of  Tucker  decomposition  [129]  on  a  2D-tensor,  a  matrix. 
Our  five  modeling  choices  can  also  be  used  to  differentiate  tensor  factorizations,  but  the 
choices  may  be  subtler  for  tensors  than  for  matrices.  For  example,  sparsity  constraints  on 
the  bilinear  form  are  a  generalization  of  sparse  regression  regularizers;  sparsity  constraints 
in  a  multilinear  form  are  often  more  complex  (c.f.,  CUR-decomposition  [34]:  for  the 
multilinear  form  U AVT,  U  and  V  are  sparse  matrices,  but  A  is  dense). 


3.9  Limitations  of  Maximum  Likelihood  Approach 

All  the  matrix  factorizations  we  have  focused  on  are  estimated  under  (regularized)  max¬ 
imum  likelihood.  The  standard  Bayesian  objections  to  maximum  likelihood  estimation 
apply.  There  are  many  plausible  models,  where  plausibility  is  measured  by  a  model’s 
probability  under  p(U.  V  \  X ,  W).  Predictions  should  be  made  by  considering  the  pos¬ 
terior  predictive  distribution,  the  expectation  of  the  quantity  we  wish  to  predict  under 
p(U.  V  |  X,  W).  The  posterior  predictive  distribution  is  an  example  of  model  averaging, 
where  every  model,  weighted  by  its  posterior  probability,  contributes  to  the  prediction. 

3Zeros  in  the  matrix  are  considered  missing  values,  and  are  assigned  zero  weight. 
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Since  maximum  likelihood  chooses  a  single  point  ( U *,  V*),  there  is  no  model  averaging, 
which  leads  to  higher  variance  in  prediction.  Moreover,  point  estimators  ignore  uncer¬ 
tainty  in  the  model  parameters  (U,  V). 

In  the  case  of  low-rank  matrix  factorization  there  is  a  more  subtle  argument  against 
maximum  likelihood  solutions:  they  define  a  generative  distribution  for  entries  of  a  matrix, 
but  not  for  rows  or  columns  of  a  matrix.  As  a  result,  maximum  likelihood  matrix  factor¬ 
izations  are  statistically  ill-defined  when  the  goal  is  to  predict  on  new  rows  or  columns  of 
the  data  matrix,  which  did  not  appear  in  the  training  data.  This  is  particularly  troubling 
when  rows  and  columns  of  the  matrix  correspond  to  grounding  of  a  relation:  maximum 
likelihood  matrix  factorization  can  incorrectly  define  the  probability  of  a  new  entity  par¬ 
ticipating  in  the  relationship.  Following  the  conventions  of  the  literature  on  collaborative 
filtering,  we  refer  to  this  problem  as  strong  generalization. 

The  theoretical  limits  of  maximum  likelihood  matrix  factorization  for  strong  general¬ 
ization  were  first  pointed  out  by  Welling  et  al.  [131].  We  present  a  detailed  explanation  of 
the  argument  of  Welling  et  al.  [131],  applying  it  to  matrix  factorizations  not  considered  in 
that  paper.  Our  goal  is  to  derive  the  form  for  common  maximum  likelihood  factorizations 
from  a  statistical  viewpoint,  beginning  with  the  likelihood  of  the  data. 

The  most  elementary  premise  we  can  make  is  that  there  is  a  row-column  exchangeable 
data  matrix  X,  the  factors  are  U  and  V,  and  there  are  no  data  weights.  Any  row-column 
exchangeable  matrix  can  be  viewed  as  a  set  of  exchangeable  rows  or  as  a  set  of 

exchangeable  columns  We  consider  the  case  of  exchangeable  rows;  the  derivation 

for  exchangeable  columns  is  similar. 

Below,  we  work  through  the  derivation  for  the  maximum  likelihood  method  for  pre¬ 
dicting  the  distribution  of  a  new  row  of  A".  As  will  be  seen,  this  derivation  contains  a 
critical  flaw,  rendering  predictions  on  this  new  row  suspect.  The  flaw  in  the  derivation 
may  explain  why  matrix  factorization  models  tend  to  have  poor  predictive  power  on  strong 
generalization  tasks.  While  there  may  exist  other,  less  problematic,  derivations  for  strong 
generalization  on  matrix  models,  none  are  currently  known. 

Let  v  =  {V,  fiv, 'Ey}  contain  the  factor  V  and  the  parameters  of  the  prior  over  V : 


p(v)=n^-i^=v), 

3= 1 

where  (nv,  ^v)  arc  the  parameters  of  a  /('-dimensional  Gaussian  prior  over  each  row  of  V. 
Let  7  =  {nzn  contain  the  prior  parameters  of  a  similar  A;-dimensional  Gaussian  prior 
over  each  row  of  U .  Let  ©  =  {i©  7}  be  the  set  of  parameters.  Ultimately,  we  will  optimize 
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a  function  of  0.  Using  the  notation  we  have  introduced: 


P+{X,U)  = 

i= 1 
m 

=  YlPu(Xi.\Ui.)p^Ui.). 
1=1 


There  is  still  a  large  gap  between  p<j,(X,  U )  and  anything  recognizable  as  a  matrix  fac¬ 
torization.  The  following  variational  EM  derivation  bridges  the  gap.  Consider  the  log- 
likelihood  of  a  row  of  the  data  matrix,  Xv : 


log P<i>{Xi.)  =  log  J  p^{Xi.,Ui)  dUi., 

i  f  m  I  v  \P*(Xi;Ui-) 

=  '0SLq{u-'x,)ww)' 

>  [  Iog<;(C7,|X.)P*/,f  i'y‘0 

Jui.  q(ui- 1  xi-) 


’Ui. 


q(Ui.\Xi.)logp<l>(Xi.,Ui.) 


q(Ui.\Xi.)  log q(Ui]Xi)  . 


(3.11) 

(3.12) 


'Ui. 


S(0) 


-H(q) 


Equation  3.11  follows  from  Jensen’s  inequality.  The  value  of  the  entropy  of  the  variational 
distribution  H(q)  is  critical  to  this  argument. 

The  variational  distribution  q  is  an  approximation  to  the  true  posterior  over  latents, 
p(Ui.  |  Xi.).  We  are  free  to  set  q  to  whatever  we  choose.  For  reasons  which  will  become 
clear  below,  we  choose  q  to  be 

k 

q(Ul.\Xl.)  =  \[5(Ulk-Uik),  (3.13) 

i=i 

c(  \  / 1,  Uy^O, 

0,  otherwise 


where  {Uik}i,k  are  variational  parameters,  which  must  be  optimized  to  achieve  a  tight 
lower  bound  on  the  data  likelihood  logp^pQ.).  We  shall  need  to  consider  Uik  under  the 
assumption  that  it  is  discrete,  and  later  that  it  is  continuous.  To  simplify  our  equations  we 
denote  both  summation  and  integration  over  U  using  Yl-  Likewise  we  use  5  to  denote  both 
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a  0-1  indicator  and  a  Dirac  delta.  If  were  discrete,  then  H(q)  would  be  zero.  For  now, 
let  us  pretend  that  H(q)  is  also  zero  when  Uik  is  continuous.  Under  our  assumption  that 
H(q )  is  zero,  B((p)  is  a  lower  bound  on  the  log-likelihood.  Substituting  3.13  into  B((p): 

log p<p (Xt. )  >  B(4>) 

k 

=  E  n  -  Uit)  log P4>(Xi.  I  Ui.) 

Ui.  1=1 
k 

=En  S{Uii-Uii)log\pv(Xi.\Ui.)pJ(Ui.)\ 

Ui.  e=  i 

=  log pu(Xi.  |  Ui.)  +  log py(Ui.).  (3.14) 


Summing  Equation  3.14  over  all  the  rows  of  X  yields  the  following  lower  bound: 


logp(X)  >  0(<f>,  U)  =  lo§  Vu{Xi.  |  Ui.)  +  ^  l°g  Pi(Ui-) 


(3.15) 


The  reason  for  our  choice  of  q  becomes  clear:  optimizing  the  variational  lower  bound 
reduces  to  alternating  optimization  over  the  factors.  Alternating  optimization  consists  of 
optimizing  O  by  coordinate  descent:  cyclically  optimize  over  U  (a.k.a.  the  variational  E- 
step)  and  cp  (a.k.a.  the  variational  M-step)  till  convergence.  The  variational  parameters  U 
are  a  point  estimate  of  the  U  factor,  and  (p  contains  the  point  estimate  of  the  V  factor.  The 
first  summation  in  O  can  be  viewed  as  a  loss  function,  with  the  second  summation  acting 
as  a  regularizer  over  the  U  factor. 

The  defect  in  the  derivation  of  O  from  p(X)  is  that  it  assumes  the  entropy  of  the 
variational  distribution  H(q )  =  0.  While  this  is  true  when  the  latent  variables  U  are 
discrete,  this  is  not  true  when  the  latent  variables  are  continuous.  It  can  be  shown  that  if 
Ua  is  continuous  and  unbounded,  or  continuous  with  the  stochastic  constraint  Urt  = 
1  A  Utf  >  0,  then  H(q )  =  —  oo.  Many  of  the  matrix  factorizations  in  Table  3.1  are  ones 
where  H(q)  =  —oo.  When  H(q)  is  not  finite,  Equation  3.15  is  false,  and  there  is  no  reason 
to  believe  that  optimizing  the  lower  bound  will  yield  a  good  low -rank  representation  for  a 
new  row  of  A". 

What  we  have  provided  is  an  argument  that  indicates  how  maximum  likelihood  esti¬ 
mation  can  go  wrong  on  folding-in.  The  severity  of  the  problem  depends  on  the  matrix 
factorization  model  and  the  data.  As  we  shall  see  in  the  next  chapter,  a  Bayesian  matrix 
factorization  is  never  subject  to  the  folding-in  problem. 
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3.10  Bayesian  Matrix  Factorization 


The  discussion  thus  far  has  centered  on  regularized  maximum  likelihood  estimation,  which 
suffers  from  the  limitations  discussed  in  Section  3.9.  Bayesian  Matrix  Factorizations  do 
not  suffer  from  the  aforementioned  limitations.  In  this  section,  we  briefly  discuss  how  our 
unified  view  of  matrix  factorization  relates  to  Bayesian  models  of  matrix  data.  We  explore 
Bayesian  matrix  factorization  in  further  detail  in  Chapter  5. 

The  signature  difference  of  Bayesian  matrix  factorization  is  that  we  model  the  entire 
posterior  distribution  of  the  parameters: 


{U,V)~p{U,V\X,W), 


p(U ,  V  I  X,  W) 


p{X  I  U,  V,  W)p(U,  V  I  W) 
f  p(X  I  U,  V,  W)p(U,  V  I  W)  dUdV ' 


(3.16) 

(3.17) 


The  posterior  distribution  models  uncertainty  in  our  parameter  estimates.  In  the  previous 
section,  we  discussed  how  maximum  likelihood  matrix  factorizations  can  lead  to  predic¬ 
tions  inconsistent  with  the  joint  distribution  of  the  parameters  and  data.  In  contrast,  the 
posterior  distribution,  Equation  3.17,  is  consistent  with  the  joint  distribution.  The  criticism 
of  Welling  et  al.  [131]  does  not  apply  to  Bayesian  matrix  factorization,  which  correctly  de¬ 
fines  a  generative  distribution  over  rows  and  columns. 

At  first  glance,  Equation  3.17  bears  little  semblance  to  the  objective  we  optimize  in 
maximum  likelihood  matrix  factorization:  Equation  3.5.  However,  consider  the  problem 
of  finding  the  posterior  mode: 


(U*,  V *)  =  argmax  (logp(X  |  U,  V,  W)  +  log  p{U,  V\W)  +  c), 

(u,v) 


c  =  log 


p(X  |  U,  V,  W)p(U,  V  |  W)  dUdV. , 


where  c  is  a  constant  that  does  not  vary  with  the  parameters.  Recalling  the  relationship 
between  regular  exponential  families  and  Bregman  divergences,  it  is  easy  to  show  that 
log/; (A"  |  U,  V,  W)  is  equivalent  to  the  losses  we  consider  in  maximum  likelihood  matrix 
factorization.  Moreover,  if  p(U ,  V  |  W)  =  p(U)p(V)  then  the  log-prior  acts  as  a  regular- 
izer  over  each  factor.  For  example,  if  p(U)  =  fX(  p(Ui.),  and  Ulr  is  a  multivariate  Gaussian 
with  diagonal  covariance,  then  log p(U)  is  equivalent  to  (^-regularization  of  U.  In  many 
cases,  the  optimization  Equation  3.5  corresponds  to  finding  the  mode  of  the  posterior  dis¬ 
tribution  in  Equation  3.17.  All  the  matrix  factorizations  in  Table  3.1  can  be  generalized  to 
the  Bayesian  case,  but  there  is  no  guarantee  that  sampling  from  the  posterior  is  easy. 
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A  common  approach  for  sampling  from  the  posterior  is  Markov  Chain  Monte  Carlo 
(MCMC),  where  one  cyclically  samples  over  parameters  in  such  a  manner  that  the  distri¬ 
bution  of  samples  is,  eventually,  the  posterior.  To  efficiently  sample  over  subsets  of  the 
parameters,  one  often  has  to  restrict  the  permissible  combinations  of  likelihood  and  prior. 
For  example,  Bayesian  Probabilistic  Matrix  Factorization  [109]  assumes  that  the  likeli¬ 
hood  and  prior  are  Gaussian.  Discrete  Component  Analysis  [19]  also  makes  restrictive 
conjugacy  assumptions  on  the  combination  of  likelihood  and  prior.  One  of  the  contribu¬ 
tions  of  our  Bayesian  extension  to  Collective  Matrix  Factorization  is  an  adaptive  MCMC 
algorithm  that  is  computationally  efficient  without  restricting  the  permissible  combina¬ 
tions  of  likelihood  and  prior.  In  the  single-matrix  scenario,  another  approach  that  does 
MCMC  for  Bayesian  Matrix  Factorization  without  conjugacy  is  Bayesian  Exponential 
Family  PCA  [84].  Our  approach  uses  both  the  gradient  and  partial  Hessian  of  the  objec¬ 
tive;  Bayesian  Exponential  Family  PCA  uses  only  the  gradient. 

Related  to  Bayesian  inference  in  matrix  factorization  is  the  matter  of  hierarchical  pri¬ 
ors.  Just  as  de  Finetti  exchangeability  motivates  hierarchical  Bayesian  modeling,  row- 
column  exchangeability  (esp.  Aldous’  theorem)  provides  a  theoretical  foundation  for  hi¬ 
erarchical  priors  on  U  and  V.  Hierarchical  priors  have  the  effect  of  pooling  information 
across  the  rows  and  columns.  Pooling  information  across  rows  can  greatly  alleviate  spar¬ 
sity  in  the  training  data.  If  a  matrix  is  row-column  exchangeable,  then  the  distribution 
of  an  entry  depends  on  global  effects,  row  effects,  column  effects,  and  dyadic  effects.  A 
hierarchical  prior  on  U  allows  one  to  learn  how  the  global  effect  influences  the  prediction 
for  a  particular  entry  (likewise  V ).  We  further  discuss  the  use  of  hierarchical  priors  in 
Chapter  5. 


3.11  Matrix  Factorization  as  a  Statistical  Design  Pattern 

A  significant  part  of  software  design  patterns  is  the  formalization  of  the  descriptive  pro¬ 
cess.  Gamma  et  al.  [37]  lists  the  components  of  a  design  pattern;  we  discuss  how  they 
relate  to  a  statistical  design  pattern,  using  low-rank  matrix  factorization  as  our  example. 

>  Pattern  name  and  classification:  “The  pattern’s  name  conveys  the  essence  of  the  pat¬ 
tern  succinctly”.  Low-rank  matrix  factorization  conveys  the  essence  of  the  problem,  and 
is  a  terminology  widely  used  in  the  statistics  and  machine  learning  community. 

>  Intent:  “What  does  the  design  do?  What  is  its  rationale  and  intent?  What  particular 
design  problems  does  it  address?”.  Low  rank-matrix  factorization  represents  an  m  x  n 
data  matrix  X  as  a  function  of  the  product  of  two  lower-rank  factors  UmXk  and  VnXk- 
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That  is,  X  ps  f(UVT).  The  purposes  of  this  pattern  are  (i)  predicting  unobserved  entries 
of  the  data  matrix  X  (hold-out  prediction);  (ii)  predicting  unobserved  entries  of  a  new 
row  or  column  appended  to  the  data  matrix  (fold-in  prediction);  (iii)  clustering  the  rows 
and/or  columns  of  the  matrix.  Low-rank  matrix  factorization  addresses  the  choice  of 
parameterization  issue.  There  are  many  choices  for  models  of  X;  this  is  one  that  has 
been  successfully  used  in  many  domains,  and  respects  our  theoretical  understanding  of 
the  problem  (cf,  Aldous’  theorem). 

t>  Also  Known  As:  “Other  well-known  names  for  the  pattern,  if  any”.  Variants  of  low- 
rank  matrix  factorization  go  by  many  names  in  the  literature:  principal  components 
analysis,  matrix  co-clustering,  matrix  factorization,  two-layer  latent  variable  models, 
Bayesian  matrix  factorization.  See  Table  3.1. 

>  Motivation:  “A  scenario  that  illustrates  a  design  problem... and  how  the  pattern  solvefs] 
the  problem”.  The  examples  discussed  in  Section  3.1  serve  as  motivating  examples. 

>  Applicability:  “What  are  the  situations  in  which  the  design  pattern  can  be  applied? 
What  are  examples  of  poor  design  that  the  pattern  can  address?  How  can  you  recognize 
these  situations?”  Low-rank  matrix  factorization  can  be  applied  when  the  data  can  be  set 
up  as  a  matrix,  where  we  believe  the  value  in  entry  X7J-  depends  in  some  fashion  on  the 
identity  of  the  entities  indexed  at  row  i  and  column  j.  In  essence,  the  statistical  design 
pattern  is  applicable  when  one  has  a  row-column  exchangeable  matrix.  An  example  of 
a  poor  design  that  matrix  factorization  resolves  are  techniques  that  work  on  matrix  data, 
but  treat  them  as  exchangeable  or  iid  rows  of  data  (i.e.,  confusing  exchangeability  and 
row-column  exchangeability). 

t>  Structure:  “A  graphical  representation  of  the  classes  in  the  pattern...”.  The  closest 
analogue  in  graphical  models  would  be  a  visual  description,  such  as  the  plate  model  in 
Figure  3.1. 

>  Participants:  “The  classes  and/or  objects  participating  in  the  design  pattern  and  their 
responsibilities.”  The  closest  analogue  in  a  statistical  design  pattern  is  the  training  ob¬ 
jective  and  the  optimization  technique,  which  abstracts  the  statistical  model  into  a  form 
that  can  be  readily  implemented. 

>  Collaborations:  “What  are  the  trade-offs  and  results  of  using  the  pattern?  What  aspects 
of  the  system  structure  does  it  let  you  vary  independently?”.  There  are  trade-offs  in 
the  use  of  matrix  factorization:  latent-variable  models  are  difficult  to  interpret,  there  is 
no  intensional  definition  of  a  topic/cluster/factor;  the  parameter  space  grows  with  the 
number  of  rows  and  columns,  defeating  any  theoretical  arguments  based  on  asymptotic 
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rows 


Figure  3.1:  Plate  model  representation  of  maximum  likelihood  single  matrix  factorization. 
The  intersection  of  the  plate  is  interpreted  as  “for  each  combination  of  row  i  and  column 
f\  Shaded  nodes  refer  to  known  or  observed  variables.  Unshaded  nodes  refer  to  unknown 
(unobserved)  variables. 


consistency;  training  involves  large-parameter  optimizations  over  non-linear  objectives, 
with  many  local  optima.  With  respect  to  varying  system  structure,  the  six  modeling 
choices  we  have  described  in  this  chapter  can  be,  and  in  the  literature  are,  independently 
varied. 

>  Implementation:  “What  pitfalls,  hints,  or  techniques  should  you  be  aware  of  when 
implementing  this  pattern?”  We  briefly  discussed  some  of  the  algorithmic  issues  in 
training,  which  are  more  extensively  explored  in  the  context  of  collective  matrix  factor¬ 
ization,  which  subsumes  low-rank  matrix  factorization  (Chapter  4). 

>  Sample  Code:  “Code  fragments  that  illustrate  how  you  might  implement  this  pattern”. 
The  publications  underlying  this  thesis  include  a  MATLAB  toolkit  for  low-rank  matrix 
factorization  (as  part  of  collective  matrix  factorization).4 

>  Known  Uses:  A  few  of  the  examples  of  low-rank  matrix  factorization  include  docu¬ 
ment  clustering  (e.g.,  pLSI),  collaborative  filtering  [41],  and  predicting  links  in  social 
networks  [51].  . 

>  Related  Patterns:  “What  design  patterns  are  closely  related  to  this  one?  What  are  the 
important  differences?”  Collective  Matrix  Factorization  is  a  generalization  of  matrix 
factorization  to  sets  of  related  matrices,  which  we  propose  as  an  approach  to  information 

4http : / /www . cs . emu . edu/ ~a jit/cmf 
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integration.  Semi-supervised  matrix  factorization  deals  the  scenario  where  either  the 
rows  or  columns  are  labelled,  and  the  goal  is  to  label  new  rows  or  columns. 

Statistical  design  patterns  formalize  expertise  in  the  development  and  application  of 
graphical  models  in  much  the  same  way  that  software  design  patterns  formalize  the  devel¬ 
opment  and  application  of  large  software  systems. 
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Chapter  4 

Collective  Matrix  Factorization 

4.1  Introduction 


In  the  last  chapter,  we  considered  the  simplest  example  of  a  relational  data  set — an  arity 
two  relation  where  the  rows  and  columns  index  entities.  A  limitation  of  matrix  factoriza¬ 
tion  is  that  we  only  consider  one  relation;  but,  going  back  to  the  introduction,  our  interest 
is  in  data  sets  where  entities  can  participate  in  many  different  relations  (properties).  In  this 
chapter,  we  generalize  matrix  factorization  to  allow  an  entity  to  participate  in  more  than 
one  arity-two  relation. 

We  build  upon  low-rank  matrix  factorization,  but  as  we  have  seen  there  are  many  po¬ 
tential  models  we  could  build  on.  Since  our  goal  is  predicting  unobserved  values  of  a  re¬ 
lation,  we  shall  not  consider  clustering  and  matrix  co-clustering  methods  in  further  detail. 
Factor  analysis  models  lack  stochastic  (clustering)  constraints  on  the  factors,  which  sig¬ 
nificantly  complicate  training.  As  importantly,  relations  can  take  on  many  different  types 
of  values;  we  desire  flexibility  in  modeling  the  response  type.  If  we  consider  the  models 
in  Table  3.1,  these  desiderata  leave  us  with  Exponential  Family  PCA  and  its  regularized 
extension,  G2L2M. 

We  begin  by  discussing  the  basic  idea  of  parameter  tying  in  low-rank  matrix  factoriza¬ 
tion,  which  is  the  foundation  of  collective  matrix  factorization.  Then,  we  formally  develop 
collective  matrix  factorization  as  a  probabilistic  model  (Section  4.3).  We  discuss  how 
alternating  projections,  first  introduced  in  Section  3.5,  are  particularly  useful  in  training 
our  models.  We  also  detail  the  Newton  projection,  which  is  what  makes  training  relatively 
efficient  (Section  4.3.1).  For  clarity,  we  focus  on  a  simple  example  of  collective  matrix  fac¬ 
torization,  involving  two  related  matrices.  In  Section  4.3.3  we  show  how  the  approach  is 
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readily  generalized  to  more  than  two  related  matrices.  To  further  illustrate  the  power  of  the 
alternating  projections  approach,  we  show  how  a  technique  for  stochastic  approximation 
in  regression  models  can  be  generalized  to  low-rank  matrix  factorization  (Section  4.4). 
Returning  to  our  overarching  goal,  information  integration,  we  consider  the  problem  of 
augmented  collaborative  filtering.  In  our  experiments  (Section  4.5)  we  augment  a  relation 
that  represents  users’  ratings  of  movies  with  side  information  about  the  movies,  namely 
which  genres  describe  the  movie.  We  show  that  collective  matrix  factorization  can  use 
genre  information  to  improve  the  quality  of  ratings  prediction,  and  vice  versa. 

Main  Contributions:  While  we  do  not  claim  ownership  of  the  idea  of  tying  low-rank 
parameters  in  matrix  factorization,  we  are  the  first  to  apply  low-rank  parameter  tying  to 
Exponential  Family  PCA.  Since  E-PCA  is  so  flexible  in  the  choice  of  data  distributions, 
we  can  deal  with  relations  of  differing  response  type.  The  most  significant  contributions 
are  in  the  learning  algorithm,  where  we  extend  alternating  Newton-projection  on  a  single 
matrix  to  sets  of  related  matrices.  The  resulting  algorithm  is  memory-efficient,  easy  to 
parallelize,  and  can  work  even  with  a  large  number  of  entities  (the  per-iteration  cost  is 
linear  in  the  number  of  entities).  Moreover,  we  consider  the  case  where  the  matrices  are 
both  large  and  densely  observed.  Here,  we  develop  a  stochastic  Newton  projection  that 
allows  one  to  trade  between  the  computational  cost  of  training  and  predictive  accuracy 
of  the  resulting  model.  Empirically,  we  have  found  that  the  stochastic  Newton  projection 
leads  to  significantly  faster  convergence  during  the  first  few  iterations,  making  it  particu¬ 
larly  suitable  for  training  models  where  CPU  resources  are  limited.  Our  stochastic  Newton 
approach  can  be  applied  anywhere  alternating  Newton-projections  can  be  used,  but  is  most 
appropriate  for  large,  densely  observed  matrices. 


4.2  Relational  Schemas 

A  relational  schema  defines  the  structure  among  a  set  of  relations.  Abstractly,  we  view  a 
relational  schema  as  a  collection  of  t  entity-types  £\  ...  £t  along  with  a  list  of  relations, 
each  represented  by  a  matrix  of  data.  Each  matrix  involves  two  entity-types,  one  for  the 
rows  and  another  for  the  columns.  There  are  nt  entities  of  type  i,  denoted  .  A 

relation  between  two  types  is  £r  ~u  £j\  index  u  e  N  allows  us  to  distinguish  multiple 
relations  between  the  same  types,  and  is  omitted  when  there  is  no  ambiguity.  The  matrix 
containing  the  values  for  £t  £:J  has  n,  rows,  n3  columns,  and  is  denoted  X^,u\  If  we 
have  not  observed  all  possible  values  of  a  relation,  we  fill  in  unobserved  entries  with  0  (so 
that  is  a  sparse  matrix),  and  assign  them  zero  weight  when  learning  parameters.  By 

convention,  for  each  relation  £,  ~  £j,  we  assume  i  <  j. 
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Since  all  the  relations  are  arity-two,  we  represent  the  schema  as  an  undirected  graph 
where  the  nodes  are  entity  types,  and  the  edges  are  abstract  relations  between  them:  E  = 
{(iJ)  ■  &  ~  £j  A  i  <  j}.1  Without  loss  of  generality,  we  assume  that  the  schema- 
as-graph  is  fully  connected.  If  not,  we  can  fit  each  connected  component  in  the  schema 
independently.  Our  definition  of  a  relational  schema  is  similar  to  an  entity-relationship 
model  [22]. 

We  fit  each  relation  matrix  as  the  product  of  latent  factors,  «  y,(*i)([/W([/0'))T); 
where  E  Mn,; x k  and  f/A)  g  Mr,.» x k  for  k  E  {1,2, . . .}.  If  Sj  participates  in  more 
than  one  relation,  we  allow  our  model  to  reserve  columns  of  for  modelling  a  specific 
relation.  This  flexibility  allows  us,  for  example,  to  have  relations  with  different  latent 
dimensions,  or  to  have  more  than  one  relation  between  £%  and  £j  without  forcing  ourselves 
to  predict  the  same  value  for  each  relation.  In  an  implementation,  we  would  store  a  list 
of  participating  column  indices  from  each  factor  for  each  relation;  but  to  avoid  clutter,  we 
ignore  this  possibility  in  our  notation.  Unless  otherwise  noted,  the  prediction  link  is 
an  element-wise  function  on  matrices. 


4.3  Collective  Factorization 

To  avoid  an  excess  of  notation,  we  introduce  Collective  Matrix  Factorization  on  a  three 
entity-type  problem,  corresponding  to  the  schema  E\  ~  Eo  ~  £3.  We  generalize  the  algo¬ 
rithm  to  three  or  more  related  matrices  in  Section  4.3.3.  The  factorizations  corresponding 
to  the  three  entity-type  schema  are 

3f{12)  w  /(12)  (V(1)  (C/(2))T)  ,  (4.1) 

X(23)  _  y (23)  2)  (u(3))T)  .  (4.2) 

Let  X  =  X(12),  Y  =  X^K  f\  =  /(12),  f2  =  /(23),  m  =  m,  n  =  n2,  r  =  n3,  U  =  U(1\ 
V  =  U (2\  and  Z  =  U^.  We  assume  that  the  latent  dimension  is  k  =  kV2  =  ^23-  We  can 
rewrite  Equations  4. 1-4.2  as 

-A"  ~  fi(UVT),  (4.3) 

Y  ~  f2(VZT):  (4.4) 

'An  abstract  relation  is  one  where  the  arguments  are  variables:  e.g..  Rating  (user, movie).  Ob¬ 
served  relations  are  sets  of  relations  where  the  arguments  are  grounded:  e.g.,  {Rating(useri,  movie^)  = 
1,  Rat  ing  (itser3,  movie  12)}- 
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In  our  experiments,  augmented  collaborative  filtering  is  an  example  of  a  problem  that  fits 
into  the  three  entity-type  schema:  E\  are  users,  £2  are  movies,  and  S3  are  genres.  A"  is  a 
matrix  of  observed  ratings,  and  Y  indicates  which  genres  a  movie  belongs  to  (each  column 
corresponds  to  a  genre,  and  movies  can  belong  to  multiple  genres). 

From  Equations  4. 3-4.4  we  choose  a  probabilistic  model  with  parameters  IF  =  {U,  V,  Z}, 
and  data  V  =  (A,  Y}.  The  likelihood  of  each  matrix  is 

m  n 

P(x \u,v,w) = nn  a  Fa  1  u<f) i Wu .  <4-5> 

*= 1  j= 1 

nr 

p(y  v,  z,  w) = n  n  a  pf  1  vizD]w,r  ■  <4-6> 

j= 1  r= 1 

The  per-entry  distributions  px  and  pY  are  one-parameter  exponential  families  with  natural 
parameter  Ut. Vj.  and  respectively.  The  modeler  is  free  to  choose  px  and  pY,  and 

they  need  not  be  from  the  same  exponential  family;  this  allows  us  to  integrate  relations 
with  different  response  types:  e.g.,  Co-occurs  may  be  well-modelled  by  the  Binomial 
distribution,  but  Response  (brain  voxel  activity)  is  best  modelled  by  a  Gaussian.  The 
weights  Wij  >  0  and  W]r  >  0  allow  us  to  handle  missing  data:  we  set  an  indicator  to 
zero  when  the  corresponding  value  in  the  data  matrix  is  unobserved.  We  can  learn  the 
factors  in  Equation  4.5  and  4.6  using  maximum  likelihood  or  Bayesian  methods.  If  we 
maximize  Equation  4.5  when  X  is  fully  observed  under  maximum  likelihood,  we  get  back 
Exponential  Family  PCA  (see  Table  3.1  under  E-PCA). 

Maximizing  the  product  of  Equations  4.5  and  4.6  with  respect  to  the  factors  F  is  an 
example  of  collective  matrix  factorization  [116].  We  place  a  multivariate  Gaussian  prior 
on  each  row  of  U: 

m 

p(U\Qu)  =  l[M(Ui.\pu^u),  (4.7) 

i=l 

where  J\f(  ■  \  //{/,  Fy)  is  a  Gaussian  with  mean  vector  pu  and  covariance  matrix  Fy.  We 
assume  that  py  =  0  and  Fy  is  spherical,  so  that  the  covariance  can  be  simplified  to 
regularization  coefficients  A y.  Ay,  and  Ay.  The  priors  over  V  and  Z  are  defined  similarly. 

In  log-space,  a  zero-mean  spherical  covariance  prior  can  also  be  encoded  using  Bregman 
divergences: 

Bg(0\\U)  =  G*(U), 

Bh(0\\V)  =  H*(V): 

0/(0  \\  z)  =  i*(z), 
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Figure  4.1:  Plate  representation  of  collective  matrix  factorization.  Shaded  nodes  indicate 
observed  data  or  quantities  which  must  be  selected,  e.g.  pu  and  Yjj.  Indicates  W  and  W 
are  elided. 

where  G(u)  =  \uu2/2,  H(y )  =  \vv2/2,  and  I(z )  =  \zz2 /2.  Appendix  A  provides  a 
complete  derivation  of  the  gradient  and  Hessian,  along  with  a  more  detailed  discussion  of 
the  Bregman  form  for  the  regularizes.  A  plate  model  for  collective  matrix  factorization  is 
provided  in  Figure  4.1. 

Taking  together,  Equations  4. 5-4. 7  define  the  posterior  distribution  of  the  parameters 
given  the  data: 

p(U,  V,  Z  |  X,  Y,  0)  =  c  ■  p(X  |  U,  V,  W)p(Y  |  V,  Z,  W)p(U  |  Q)p(V  |  Q)p(Z  |  0), 

where  c  is  a  normalizing  constant.  Maximum  a  posteriori  inference  (a.k.a.  regularized 
maximum  likelihood)  involves  searching  for  the  parameters  which  minimize  the  negative 
log-posterior 

£  =  -  log p(U,  V,  Z  |  X,  Y,  W,  W,  0).  (4.8) 

Since  we  assume  that  Xij  and  Y)r  are  draws  from  an  exponential  family  distribution,  then 
maximizing  the  likelihood  of  px  (Xr.j  \  Zij  =  UvV^)  is  equivalent  to  minimizing  the  Breg¬ 
man  divergence  D ||  Xxj ) .  Therefore,  we  can  rewrite  £  using  Bregman  divergences: 

£  =  D F  ( UVT  1 1  X,  W)  +  P f2  (vZT  1 1  Y,  w)  +  7 Z(U)  +  K{V)  +  7 Z(Z)  , 

from  p(X  |  U,V,W)  '  ~  ~ZZ~~Z ~  ^  fromp(C7 1  &u)  fromp(V|0^)  fromp(Z|ez) 

fromp(r  |  V,Z,W) 

(4.9) 

where  TZ(-)  is  the  (2-rcgularizcr  for  the  factor.  We  can  make  a  coarse  adjustment  for  the 
relative  importance  of  the  two  data  matrices  involves  scaling  D F]  by  a  <G  [0, 1]  and  BF2  by 
(1  —  a).  We  consider  the  effect  of  varying  a  in  Section  4.5. 
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4.3.1  Parameter  Estimation 


Equation  4.8  is  non-convex  in  its  parameters  (U,  V,Z).  If  we  considered  C  without  relating 
it  to  Bregman  divergence,  it  would  appear  that  our  only  recourse  is  to  apply  a  generic 
nonlinear  optimizer.  However,  Equation  4.9  reveals  structure  in  the  problem  that  we  can 
exploit.  First  C  is  the  sum  of  five  terms,  each  of  which  is  convex  in  its  arguments.  The 
problem  is  that  arguments  to  the  Bregman  divergences  are  the  product  of  factors,  UVT 
and  V ZT,  instead  of  (U,  V )  and  (V,  Z).  However,  it  can  be  shown  that  D Fl  ( UV 2  1 1  X ,  W) 
is  convex  in  U  when  V  is  fixed,  and  likewise  convex  in  V  when  U  is  fixed  [45].  Similarly, 
Op2  is  convex  in  V  when  Z  is  fixed,  and  convex  in  Z  when  V  is  fixed.  So  while  C  is  not 
convex,  it  is  componentwise  convex :  i.e.,  convex  in  one  low-rank  factor  when  the  others 
are  fixed. 

Componentwise  convexity  of  C  leads  naturally  to  alternating  projections:  cyclically 
optimize  C  with  respect  to  U,  then  V,  then  Z.  Since  C  is  convex  in  a  single  low -rank  fac¬ 
tor,  each  projection  is  a  convex  optimization.  Alternating  projections  is  block  coordinate 
descent. 

The  question  that  remains  is  how  to  implement  the  projection.  For  a  convex  optimiza¬ 
tion,  the  ideal  solution  would  be  to  use  Newton-Raphson,  which  incorporates  both  the 
gradient  and  the  Hessian.  Below,  we  outline  the  derivation  of  the  Newton  projection.  A 
more  detailed  derivation,  starting  from  the  basics  of  matrix  calculus,  is  available  in  Ap¬ 
pendix  A. 

First,  differentiate  C  with  respect  to  each  factor:2 

VuC  =  a  (WO  (h(UVT)  -X))V  +  VG*(U), 

VyC  =  a  (WO  (fi(UVT)  -  X))TU+ 

(1  -a)  (WO  (, f2(VZT )  -  Y))  Z  +  VH*(V), 

xzjC  =  (i  -  a)  (w  ©  (f2(vzT)  -  y)Y  V  +  V/*(Z) 

For  regularization  on  a  factor  U,  G(U)  =  A||C/|||/2,  therefore  XG*(U)  =  U/X.  Setting 
these  gradients  equal  to  zero  yields  the  exact  minimum.  We  can  use  gradient  descent  to 
find  a  root  of  the  above  equations. 

To  improve  performance  over  gradient  decent,  a  natural  idea  is  to  try  a  Newton  method 
for  solving  Equation  4.10-4.12.  A  cursory  inspection  of  Equations  4.10-4.12  suggests  that 
a  Newton  step  is  infeasible.  For  example,  the  Hessian  with  respect  to  U  would  involve 

2We  remind  the  reader  that  A  Q  B  refers  to  the  Hadamard  (element-wise)  product. 


(4.10) 

(4.11) 

(4.12) 


62 


nk  parameters,  and  thus  inversion  of  a  nk  x  nk  matrix,  which  would  require  0{n3k3) 
time  and  0(n2k2)  memory.  In  our  augmented  collaborative  filtering  experiments,  n  is  the 
number  of  users;  spending  0(n3k 3)  time  is  out  of  the  question. 

The  cursory  argument  assumes  that  the  Hessian  is  an  arbitrary  matrix.  However,  we 
can  show  that  when  only  one  low-rank  factor  can  vary,  most  of  the  second  derivatives  of 
C  with  respect  to  that  factor  are  zero,  and  the  Hessian  has  special  structure  that  makes 
inverting  it  easy.  For  the  subclass  of  models  where  Equations  4.10-4.12  are  differentiable 
and  the  loss  is  decomposable,  define 

q(Ui.)  =  a  (Wt.  ©  (/i (Ui.VT)  -  X,.))  V  +  VG*(£4), 
q(Vi.)  =  a  (W.i  ©  (/i(t/Vf )  -  X,))TU+ 

(1  -  a)  (lU,  ©  (/2(E.Zt)  -  n))  Z  +  ViT  (K), 

q(Z ,)  =  (!-«)  {W.i  ©  (/2(EZ4t)  -Yl))TV  +  V/*(Zj.). 

Since  all  but  one  factor  is  fixed,  consider  the  derivatives  of  q{Ut)  with  respect  to  any 
scalar  parameter  in  U :  \7Ujgq(Ui.).  Because  UjS  only  appears  in  q(Ut.)  when  j  =  i,  the 
derivative  equals  zero  when  j  =4  i.  Therefore  the  Hessian  is  block-diagonal,  where 
each  non-zero  block  corresponds  to  a  row  of  U.  The  inverse  of  a  block-diagonal  matrix 
is  the  inverse  of  each  block,  and  so  the  Newton  direction  for  U,  jVf/E[V^E|  can  be 
reduced  to  updating  each  row  Uv  using  the  direction  \q(Ui.)]  [q'{Ui.)\~l. 

The  above  derivation  shows  that  we  can  reduce  the  projection  over  the  factor  U  into  a 
set  of  parallel  row-wise  projections  over  {£4}.  By  exploiting  structure  within  the  Hessian, 
we  have  reduced  the  cost  of  using  the  Hessian  from  0(n3k3)  to  0(nk3).  We  can  further 
reduce  the  cost  of  Newton  projection  by  replacing  the  exact  computation  of  the  step  with 
an  approximation,  e.g.,  using  a  fixed  number  of  iterations  of  conjugate  gradient  to  find  the 
step  direction.  Using  an  approximate  step  reduces  the  cost  of  the  Newton  projection  to 
0(nk2)  time  and  0(nk2)  memory.3 

The  block  diagonal  Hessian  may  give  one  the  impression  that  there  are  no  interactions 
between  different  rows  of  a  factor,  which  is  not  true.  Different  rows  of  a  factor  can  in¬ 
fluence  each  other,  but  only  indirectly  through  the  other  factor  in  the  bilinear  form.  Two 
rows  of  U  interact,  indirectly,  through  their  effect  on  V.  However,  since  we  fixed  V  in  the 
projection  over  U,  the  inter-row  effects  are  masked. 

By  fixing  all  but  one  factor  and  varying  each  factor  in  turn,  we  get  a  block-diagonal 
approximation  to  the  true  Hessian  across  all  factors.  Approximating  the  Hessian  is  also  at 

3The  memory  requirements  can  be  reduced  to  0(k2)  if  the  per-row  projections  are  done  sequentially. 
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the  heart  of  quasi-Newton  methods,  like  BFGS  [90],  which  use  a  low-rank  approximation 
of  the  Hessian.  We  get  a  perfect  model  of  the  Hessian  within  each  row  by  ignoring  weaker, 
indirect,  inter-row  effects.  Low-rank  approximations  of  the  Hessian  approximate  the  inter¬ 
row  effects,  as  well  as  the  intra-row  effects.  Combining  the  two  approaches  to  improve  our 
approximation  of  the  per-factor  Hessians  V^£,  Vf  £.  and  V|£,  as  well  as  the  Hessian 
with  respect  to  all  the  factors,  V£,  remains  a  topic  of  future  work. 

It  remains  to  be  shown  how  we  can  compute  the  non- zero  blocks  in  the  Hessian.  Any 
(local)  optimum  of  the  loss  C  corresponds  to  roots  of  the  equations  (g(£4)  =  O}™^, 
{q(Vi.)  =  0}[Li,  and  {q{Zi.)  =  0}T=1. 

To  find  the  roots  of  these  equations,  we  use  Newton’s  method.  For  example,  the  New¬ 
ton  update  for  f/j.  is 


Ur  =  Ul.-q(U,)[q'(Ul.)}-1.  (4.13) 

To  concisely  describe  the  Hessian  we  introduce  terms  for  the  contribution  of  the  regular- 
izer, 


Gi  =  diag  (V2G*(C/,)),  Hi  =  diag  (V2fT(4)),  J,t  =  diag  (V2/*(Z,.)), 

and  terms  for  the  contribution  of  the  reconstruction  error, 

Dhi  =  diag(VFj.  ©  /((£/,  ^T)),  D2>i  =  diag {W.i  ©  /((tfVf )), 

D3ti  =  diag(lTj.  ©  /'(Vf  Z)),  D 4]i  =  diag(IL.i  ©  /'(TZj)). 

The  Hessians  with  respect  to  the  loss  C  are 

q\Ui.)  =  S7q(Ui.)  =  aVTD1,iV  +  Gu 

q\Z i.)  =  Vg(Z,)  =  (1  -  a)VTD^V  +  4 

4 {Vi.)  =  WqiV.)  =  aUTD2)lU  +  (1  -  a)ZTD3}iZ  +  Ht. 


Each  update  of  U,  V,  and  Z  reduces  at  least  one  term  in  Equation  4.9.  Therefore,  iter¬ 
atively  cycling  through  the  updates  leads  to  a  local  optimum.  The  Newton  step  may  be 
scaled  by  a  step  length  rj  e  (0, 1],  which  we  select  using  the  Armijo  criterion  [90].  In 
practice,  we  simplify  the  projection  by  taking  one  Newton  step,  instead  of  running  to  con¬ 
vergence.  Even  taking  one  step,  we  are  guaranteed  to  reduce  the  value  of  at  least  one  term 
in  Equation  4.9,  guaranteeing  convergence  to  a  local  optimum. 
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Adjusted  Dependent  Variate  Projection 

Since  the  per-row  updates  are  identical  to  an  iteration  of  iteratively-reweighted  least  squares 
for  a  GLM,  we  mirror  the  literature  by  rewriting  the  update  in  adjusted  dependent  variate 
form  [78].  The  adjusted  dependent  variate  form  represents  each  update  to  a  factor  row 
as  the  solution  to  a  weighted  regression  problem.  Let  Av  =  Ui./Xjj,  Bi.  =  V)./Ay,  and 
CV  =  Zi./Xz  account  for  the  regularizer  terms  in  the  gradient.  Let  77  be  the  step  length. 
Rearranging  terms  in  Equation  4.13, 

UTtfiUi.)  =  a  ( Ut.VT  +  77  (Wi.  ©  {Xr.  -  f(Ul.VT))  D-})  D^V  +  Ut.G,,  - rjA.. 

(4.14) 

Likewise  for  Z*. 

ZT<i{Zi.)  =  (1  -  a)  (zi.VT  +  7 7  (W.i  ©  (Y.i  -  f(VZ^))T  D^V  +  Z,.Ii  ~  vCi- 

(4.15) 

The  update  for  Vt.,  which  ties  the  data  matrices  together,  has  a  similar  derivation  since 
L(U}  V,  Z)  is  the  sum  of  per-matrix  losses  and  the  differential  operator  is  linear: 

Vr<aVi)  =  ot  {  ( Vt.UT  +  77  (W,  ©  (X.i  -  f{UV?)))T  D -j)  D2iiu]  +  (4.16) 

(1  -  a)  {  {vaT  +  V  (Wi.  ©  (V,  -  f(Vt.ZT )))  D-^  D3jiZ }  + 

Vi.  Hi  -  rjB,. 

4.3.2  Imbalanced  Matrices 

The  objective  C  is  the  sum  of  two  per-matrix  reconstruction  errors.  If  one  matrix  is  larger 
than  the  other,  it  will  dominate  the  objective.  We  use  data  weights,  or  equivalently  a,  to 
turn  C  into  a  per-element  objective  by  scaling  each  element  of  A"  by  (nm)~l  and  each 
element  of  Y  by  (nr)  1 .  This  heuristic  ensures  that  larger  matrices  do  not  dominate  the 
model  simply  because  they  are  larger.  Data  weights  can  also  be  used  to  correct  for  differ¬ 
ences  in  the  scale  of  Dp,  and  Bp,,  which  can  occur  when  one  of  the  Bregman  divergences 
is  not  regular.  Since  we  restrict  our  consideration  to  Bregman  divergences  that  correspond 
to  probability  distributions,  we  do  not  need  to  correct  for  differences  in  scale:  probabilities 
are  all  on  the  same  [0, 1]  scale.  If  the  Bregman  divergences  are  not  regular,  computing 

DFl ( UVT  1 1  A,  W)/DF2{VZt  1 1  Y,  W), 

averaged  over  uniform  random  parameters  U ,  V,  and  Z,  provides  an  adequate  estimate  of 
the  relative  scale  of  the  two  losses. 
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4.3.3  Generalizing  to  Arbitrary  Schemas 

We  have  focused  on  the  three  entity-type  (two  matrix)  case,  but  our  approach  can  be  ap¬ 
plied  to  any  relational  schema  E  containing  only  binary  relations.  The  objective  is 

genera,  =  «W)  (BpW  {U®  (U®  f  \  \  X™ ,  W™))  + 

(i,j)eEv(j,i)eE 

E  (  E  «,tf) ) n  ( u<<> )  • 

i=1  \j'-(i,j)£EV(j,i)£E  J 

The  relative  weights  >  0  measure  the  importance  of  each  matrix  in  the  reconstruction. 
We  scale  the  regularization  of  a  factor  C/W  by  its  relative  importance  in  the  model:  the 
degree  to  which  we  are  concerned  about  overfitting  in  factor  should  be  proportional  to  that 
factor’s  importance  in  prediction.  Since  the  loss  is  a  linear  function  of  individual  losses, 
and  the  differential  operator  is  linear,  both  gradient  and  Newton  updates  can  be  derived  in 
a  manner  analogous  to  Section  4.3.1,  taking  care  to  distinguish  when  U ^  acts  as  a  column 
factor  as  opposed  to  a  row  factor. 


4.4  Stochastic  Approximation 

In  optimizing  a  collective  factorization  model,  we  are  in  the  unusual  situation  that  our 
primary  concern  is  not  the  cost  of  inverting  the  Hessian,  but  rather  the  cost  of  computing 
the  gradient  itself:  if  k  is  the  largest  embedding  dimension,  then  the  cost  of  a  gradient 
update  for  a  row  Ur is  0(k  rij),  while  the  cost  of  a  Newton  update  for  the  same 

row  is  0(k3  +  k2  J2j-s  ~s  n3 ) •  Typically  k  is  much  smaller  than  the  number  of  entities, 
and  so  the  Newton  update  costs  only  a  factor  of  k  more.  (The  above  calculations  assume 
fully  observed  matrices;  for  sparsely-observed  relations,  we  can  replace  n,  by  the  number 
of  entities  of  type  £)■  which  are  related  to  entity  x'P .  but  the  conclusion  remains  the  same.) 

The  expensive  part  of  the  gradient  calculation  for  Ur  '}  is  to  compute  the  predicted 
value  for  each  observed  relation  that  entity  xr  participates  in,  so  that  we  can  sum  all  of  the 
weighted  prediction  errors.  One  approach  to  reducing  this  cost  is  to  compute  errors  only  on 
a  subset  of  observed  relations,  picked  randomly  at  each  iteration.  This  technique  is  known 
as  stochastic  approximation  [9].  The  best-known  stochastic  approximation  algorithm  is 
stochastic  gradient  descent;  but,  since  inverting  the  Hessian  is  not  a  significant  part  of  our 
computational  cost,  we  will  recommend  a  stochastic  Newton’s  method  instead. 
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Consider  the  update  for  Uv  in  the  three  factor  model.  This  update  can  be  viewed  as  a 
regression  where  the  data  are  Xv  and  the  features  are  the  columns  of  V.  If  we  denote  a 
sample  of  the  data  assC  ,n},  then  the  sample  gradient  at  iteration  r  is 


qAUi.)  =  a  (Wis  ©  (/([/, V?)  -  Xls))  Vs.  +  VG*(£/,), 


Similarly,  given  subsets  pC  {1, . . . ,  m}  and  q  C  {1, . . . ,  r},  the  sample  gradients  for  the 
other  factors  are 


qT(Vt.)  =  a  (Wpi  ©  (f(Up.V?)  -  Xpi))T  Up.+ 

(1  -  a)  (Wiq  ©  -  Yiq))  Zq.  +  ViT(14), 


Given  the  sample  gradient,  the  stochastic  gradient  update  for  U  at  iteration  r  is 


and  similarly  for  the  other  factors.  Note  that  we  use  a  fixed,  decaying  sequence  of  learning 
rates  instead  of  a  line  search:  sample  estimates  of  the  gradient  are  not  always  descent 
directions,  and  the  fixed  schedule  for  step  lengths  is  required  to  guarantee  convergence  of 
the  projection.  An  added  advantage  of  the  fixed  schedule  over  line  search,  in  the  context 
of  a  fast  approximate  update,  is  that  the  latter  is  computationally  expensive. 

We  sample  data  non- uniformly,  without  replacement,  from  the  distribution  induced  by 
the  data  weights.  That  is,  for  a  row  the  probability  of  drawing  Xi:i  is  Wij/Y2j  M©- 
This  sampling  distribution  provides  a  compelling  relational  interpretation:  to  update  the 
latent  factors  of  Xr  \  we  sample  only  observed  relations  involving  xf1 .  For  example,  to 
update  a  user’s  latent  factors,  we  sample  only  movies  that  the  user  rated.  We  use  a  separate 
sample  for  each  row  of  U :  this  way,  errors  are  independent  from  row  to  row,  and  their 
effects  tend  to  cancel.  In  practice,  this  means  that  our  actual  training  loss  decreases  at 
almost  every  iteration. 

With  sampling,  the  cost  of  the  gradient  update  no  longer  grows  linearly  in  the  number 

(i) 

of  entities  related  to  xr  ,  but  only  in  the  number  of  entities  sampled.  Another  advantage  of 
this  approach  is  that  when  we  sample  one  entity  at  a  time,  |s|  =  \p\  =  |g|  =  1,  stochastic 
gradient  yields  an  online  algorithm,  where  observed  relations  are  processed  as  they  appear, 
and  are  not  stored  across  iterations. 
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As  mentioned  above,  we  can  often  improve  the  rate  of  convergence  by  moving  from 
stochastic  gradient  descent  to  stochastic  Newton-Raphson  updates  [9,  11].  For  the  three- 
factor  model  the  sample  Hessians  are 


where 


To  satisfy  convergence  conditions,  which  will  be  discussed  in  Section  4.4.1,  we  use  an 
exponentially  weighted  moving  average  of  the  Hessian: 


(4.17) 


When  the  sample  at  each  step  is  small  compared  to  the  embedding  dimension,  the  Sherman- 
Morrison-Woodbury  lemma  (e.g.,  [9])  can  be  used  for  efficiency.  The  stochastic  Newton 
update  is  analogous  to  Equation  2.8,  except  that  the  step  length  follows  a  fixed  schedule, 
the  gradient  is  replaced  by  its  sample  estimate  q,  and  the  Hessian  is  replaced  by  its  sample 
estimate  q. 

4.4.1  Convergence 

The  objective  function  C  (Equation  4.9)  is  the  training  error,  whose  form  is  determined  by 
a  fixed  set  of  observations:  the  observed  entries  of  the  data  matrices,  which  we  denote  as 


X  =  {Xij\Wij>Wi,j}, 
y  —  {Yjr  I  Wjr  >  0,  Vj,  r }. 


Minimizing  the  training  error,  which  is  determined  from  a  fixed  batch  of  observations,  is 
known  as  batch  learning.  In  contrast,  online  learning  assumes  that  the  observations  are 
drawn  from  a  fixed  distribution  over  observations:  p(X,  jf'),  also  known  as  the  sampling 
distribution.  As  discussed  in  the  previous  section,  the  weights  determine  the  sampling 
distribution;  if  the  weights  are  all  either  zero  or  one,  then  we  sample  the  data  uniformly. 
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At  each  iteration  of  the  training  algorithm,  a  data  point  is  drawn  from  the  sampling  dis¬ 
tribution,  used  to  update  the  parameters,  and  then  discarded.  Samples  are  drawn  indepen¬ 
dently  across  iterations,  so  the  same  matrix  entry  may  be  sampled  repeatedly.  Stochastic 
optimization  is  akin  to  online  learning,  except  that  we  use  a  batch  of  samples  at  each 
iteration.4 

While  using  less  data  at  each  iteration  is  computationally  advantageous,  there  is  also 
a  compelling  statistical  argument  to  stochastic  optimization.  The  training  error  C  can  be 
expressed  as  the  (weighted)  average  of  reconstruction  errors,  plus  regularization  terms  in¬ 
dependent  of  the  data.  The  expected  loss  E[C\  can  similarly  be  defined  using  the  sampling 
distribution: 

E[C}=  [BFl(UVT\\X)+BF2(VZT\\Y)\p{xy)(X,Y)+  ]T  7 Z(F). 

(xexyey)  Fe{u,v,z} 


The  linearity  of  expectations  allows  for  the  sampling  over  matrices  to  be  reduced  to 
sampling  in  the  innermost  loop,  where  E[C\  is  minimized  with  respect  to  a  factor  row. 
The  stochastic  optimization  in  the  previous  section  is  analogous  to  alternating  Newton- 
projection,  except  that  it  minimizes  E[C\.  Since  we  are  sampling  a  fixed  set  of  observa¬ 
tions  according  to  their  weights,  E[C\  =  C. 

The  convergence  argument  for  alternating  Newton-projection  is  straightforward.  Each 
projection  strictly  decreases  the  objective  C.  which  is  bounded  below.  However,  if  the 
projection  is  implemented  as  a  step  of  stochastic  Newton,  there  is  no  guarantee  of  that  the 
objective  will  decrease  after  each  projection.  We  can  show  that  stochastic  Newton  is  a 
convergent  procedure,  if  a  sufficient  number  of  iterations  are  run. 

Theorem  2  (Convergence  of  stochastic  Newton  projection).  Let  C(Fi.)  correspond  to 
E[C\  where  all  but  one  row  of  a  low-rank  factors  F  e  { U.  V)  Z }  is  fixed.5  Let 

[>  F*  denote  the  local  optimum, 

>  F^0>  the  initial  value  of  the  factor  row,  and,  {F^}^  a  sequence  of  stochastic  Newton 
iterates  on  F,.. 

4Online  learning  typically  assumes  that  the  data  generating  distribution  is  unknown,  but  that  an  oracle 
is  available  to  generate  samples.  In  particular,  there  are  an  infinite  number  of  possible  samples  in  online 
optimization;  in  our  use  of  stochastic  optimization,  there  are  only  a  finite  number  of  possible  samples. 

5It  does  not  matter  if  Fr,  j  f  i  is  fixed  or  not,  since  row  updates  are  independent  in  the  projection. 
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Consider  the  final  phase  of  convergence,  where  the  parameters  are  in  a  region  around  the 
optimum  F*  where  general  convexity  holds: 


Ve  >  0,  inf  (Ft.  -  F*,  VC(F,,))  >  0.6  (4.18) 

'  \\Fi.-Fl\\l>e 

If  we  assume  that  the  data  and  parameters  are  uniformly  bounded,  then  the  sequence  of 
stochastic  Newton  iterates  converges  almost  surely, 

lim  ff  =  Ft, 

I— XX) 

with  probability  one. 


Proof  We  use  results  from  [11,  10,  12],  which  provide  the  following  sufficient  conditions 
for  convergence  of  stochastic  Newton  optimization.  These  conditions  must  hold  in  a  region 
around  F*  where  Equation  4.18  holds: 

1 .  Decreasing  step  lengths:  The  sequence  of  step  length  r/T  must  satisfy  the  following 

conditions:  rh  =  oo,  Y^T=i  hr  <  °°- 

2.  Bounded  stochastic  gradient:  For  a  single  data  sample  (Xij,Yjr)  the  gradient  is 
bounded: 


VE).  3 Apti,  BFi  >  0,  E[qr(Fi.)\  <  AF)i  +  BpfiFi.  —  F*)2. 

3.  Bound  curvature  on  the  Hessian:  The  eigenvalues  of  the  Hessian  gr(Fj.)  are 
bounded  by  constants  Xmax  >  \rnm  >  0,  for  all  r  with  probability  one. 

4.  Convergence  of  the  Hessian:  The  perturbation  of  the  sample  Hessian  from  its  mean 
is  bounded.  Let  VT-\  consist  of  the  history  of  the  stochastic  Newton  iterations:  the 
data  samples  and  the  parameters  for  the  first  r  —  1  iterations.  Let  gT  =  os  ( fT )  de¬ 
note  an  almost  uniformly  bounded  stochastic  order  of  magnitude.7  The  convergence 
condition  on  the  Hessian  is  a  concentration  of  measure  statement: 

E[qT \VT-i]  =  qT  +  os(1/t). 

7The  stochastic  o-notation  is  similar  to  regular  o-notation,  except  that  we  are  allowed  to  ignore  measure- 
zero  events. 
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The  objective  C(Fi.)  is  strictly  convex,  due  to  the  £2-regularizer,  so  general  convexity 
holds  over  the  domain  of  the  objective.  Our  choice  of  step  length  schedule  r/r  =  1/r 
is  motivated  by  the  decreasing  step  lengths  condition.  Because  the  parameters  and  data 
are  uniformly  bounded  in  an  interval,  any  smooth  continuous  function  of  these  quantities 
will  also  be  uniformly  bounded.  The  gradient  and  Hessian  are  continuous  functions  of  the 
parameters  and  the  data,  and  so  the  curvature  conditions  on  the  stochastic  gradient  and 
moving  average  Hessian  are  met.  The  f2-regularizer  guarantees  that  the  sample  Hessians 
are  invertible,  and  therefore  q  is  also  invertible,  so  its  eigenvalues  are  lower  bounded  away 
from  zero.  The  convergence  condition  on  the  Hessian  motivates  Equation  4.17: 

E[qr \VT-i]  =  ^1  -  qT- 1  +  C~_E\q'T\Pr-\\i 

since  VT-\  contains  qT_  1.  Any  perturbation  from  the  mean  is  due  to  the  second  term. 
We  assumed  the  parameters  are  uniformly  bounded,  and  so  the  elements  of  E[qr \VT-i] 
are  uniformly  bounded;  since  this  term  has  bounded  elements  and  is  scaled  by  2/r,  the 
perturbation  is  os  ( 1  /r ) .  □ 


The  above  proof  also  serves  as  a  proof  for  the  convergence  of  stochastic  gradient 
descent:  all  the  conditions  for  convergence  hold  by  setting  the  Hessian  to  the  identity: 

<?(■)  =  4xfc. 

Theorem  2  provides  for  convergence  of  a  projection:  if  we  run  enough  steps  of  stochas¬ 
tic  Newton  within  a  projection  step,  the  loss  is  guaranteed  to  decrease,  and  the  overall  con¬ 
vergence  of  alternating  projections  is  preserved.  In  practice,  we  run  one  step  of  stochastic 
Newton  to  update  each  row  factor,  and  we  have  not  observed  any  ill-effects  for  conver¬ 
gence.8 


4.5  Experiments 

In  this  chapter,  we  focus  on  the  augmented  collaborative  filtering  task:  can  we  improve 
predictions  for  unobserved  values  of  a  rating  relation  Rat  inq(user,  movie )  G  {1, . . . ,  5} 
using  side  information  about  the  genres,  HasGenr e(movie}  genre)  G  {0,1}  ?  Like 

8It  should  be  noted  that  while  stochastic  Newton  projection  can  still  yield  a  convergent  algorithm,  the 
quadratic  rate  of  convergence  guarantees  for  Newton  projection  do  not  hold  for  stochastic  Newton  projec¬ 
tion.  The  advantage  of  stochastic  Newton  over  stochastic  gradient  is  a  better  sublinear  rate  of  convergence. 
However,  since  we  are  wrapping  the  second-order  step  inside  a  coordinate  descent  algorithm,  the  difference 
is  moot  with  respect  to  rate  of  convergence  guarantees  on  C. 
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many  information  integration  tasks,  the  relations  come  from  two  different  data  sources, 
each  providing  a  view  into  properties  of  a  shared  entity.  User  ratings  are  drawn  from  the 
Netflix  Prize  data  [89].  Genres  were  added  from  the  Internet  Movie  Database  [54].  We 
briefly  consider  a  more  complicated  schema,  where  in  addition  to  ratings  and  genres  we 
include  information  of  which  actors  are  in  a  particular  movie:  HasRol e(actor,  movie). 

To  yield  a  harder  data  set  for  stochastic  optimization,  we  binarize  the  ratings: 


IsRated (user,  movie)  G  {0,1}. 


IsRated  tells  us  whether  a  user  was  interested  enough  in  a  movie  to  watch  and  rate 
it.  While  ratings  are  sparsely  observed,  IsRated  is  densely  observed.  In  schema  nota¬ 
tion  E\  corresponds  to  users,  £■>  corresponds  to  movies,  £3  corresponds  to  genres,  and  £4 
corresponds  to  actors. 

The  experiments  provide  evidence  for  three  claims:  (i)  that  integrating  different  rela¬ 
tionships  can  improve  predictions  (see  Section  4.5.1);  (ii)  that  stochastic  Newton  projec¬ 
tions  can  be  effectively  used  to  reduce  the  cost  of  training  (see  Section  4.5.2);  and  (iii) 
that  while  factor  analysis  is  a  more  general  representation  than  co-clustering,  this  does  not 
imply  superior  generalization  (see  Section  4.5.3). 


Model  and  Optimization  Parameters 

For  consistency,  we  control  many  of  the  model  and  optimization  parameters  across  the 
experiments.  When  the  ratings  relation  is  IsRated,  we  use  a  logistic  model:  sigmoid 
link  with  the  matching  log-loss.  When  the  ratings  are  ordinal,  we  use  a  Poisson  model: 
exponential  link  with  the  matching  loss,  unnormalized  KL-divergence.  In  either  setting, 
we  evaluate  the  test  error  using  mean  absolute  error  (MAE).  Since  the  data  for  IsRated 
is  highly  imbalanced  in  favour  of  movies  not  being  rated,  and  since  positive  predictions 
are  more  important  than  negative  ones,  we  scale  the  weight  of  the  negative  entries  down 
by  the  fraction  of  observed  entries  where  the  relation  is  true.  Unless  otherwise  stated 
the  regularizes  are  all  G*(-)  =  10_5||  •  \\2F/2,  normalized  to  a  per-element  regularize^ 
as  suggested  in  Section  4.3.2.  In  Newton  steps,  we  use  an  Armijo  line  search,  rejecting 
updates  with  step  length  smaller  than  r;  =  2-4.  Using  Newton  steps,  we  run  till  the  change 
in  training  loss  falls  below  5%  of  the  objective;  using  stochastic  Newton  steps,  we  run  for 
a  fixed  number  of  iterations. 
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4.5.1  Relations  Improve  Predictions 


Claim :  Collective  Matrix  Factorization  can  use  HasGenre  to  improve  the  prediction  of 
I  s Rated,  and  vice-versa. 

Our  evidence  for  the  claim  consists  of  experiments  on  two  relatively  small  augmented 
collaborative  filtering  tasks,  to  allow  for  repeated  trials.  Since  we  are  using  the  three  entity- 
type  model  there  is  a  single  mixing  parameter,  a,  in  Equation  4.9.  We  measure  how  the 
predictive  accuracy  of  Collective  Matrix  Factorization  varies  with  a,  using  Newton  steps. 
We  leam  a  model  for  several  values  of  a,  starting  from  the  same  initial  random  parameters, 
using  full  Newton  steps.  The  performance  on  a  test  set,  held-out  entries  sampled  from  the 
matrices  according  to  the  test  weights,  is  measured  at  each  a.  Each  trial  is  repeated  ten 
times  to  provide  1- standard  deviation  bars. 

Two  scenarios  are  considered.  In  the  first  scenario,  users  and  movies  were  sampled 
uniformly  at  random;  all  genres  that  occur  in  more  than  1%  of  the  movies  are  retained.  We 
only  use  the  users’  ratings  on  the  sampled  movies.  In  the  second  scenario,  we  only  sample 
users  that  rated  at  most  40  movies,  which  greatly  reduces  the  number  of  ratings  for  each 
user  and  each  movie.  In  the  first  case,  the  median  number  of  ratings  per  user  is  60  (the 
mean,  127);  in  the  second  case,  the  median  number  of  ratings  per  user  is  9  (the  mean,  10). 
In  the  first  case,  the  median  number  of  ratings  per  movie  is  9  (the  mean,  21);  in  the  second 
case,  the  median  number  of  ratings  per  movie  is  2  (the  mean,  8).  In  the  first  case  we  have 
TT-i  =  500  users  and  n2  =  3000  movies  and  in  the  second  case  we  have  n i  =  750  users  and 
n2  =  1000  movies.  We  use  a  k  =  20  embedding  dimension  for  both  matrices. 

The  first  rating  scenario,  Figure  4.2,  shows  that  collective  matrix  factorization  im¬ 
proves  both  prediction  tasks — whether  a  user  rated  a  movie,  and  which  genres  a  movie 
belongs  to — thus  verifying  the  claim  at  the  beginning  of  the  section.  When  a  =  1  the 
model  uses  only  rating  information;  when  a  =  0  it  uses  only  genre  information. 

In  the  second  rating  scenario,  Figure  4.3,  there  is  far  less  information  in  the  ratings 
matrix.  Half  the  movies  are  rated  by  only  one  or  two  users.  Because  there  is  so  little 
information  between  users,  the  extra  genre  information  is  more  valuable.  As  can  be  seen 
by  comparing  unaugmented  genre  prediction  ( a  =  0)  to  intermediate  values  of  a,  there  is 
no  guarantee  that  mixing  information  will  improve  predictive  accuracy. 

We  hypothesized  that  adding  in  the  roles  of  popular  actors,  in  addition  to  genres,  would 
further  improve  performance.  By  symmetry  the  update  equation  for  the  actor  factor  is 
analogous  to  the  update  for  the  genre  factor.  Since  there  are  over  100,000  actors  in  our 
data,  most  of  which  appear  in  only  one  or  two  movies,  we  selected  500  popular  actors 
(those  that  appeared  in  more  than  ten  movies).  Under  a  wide  variety  of  settings  for  the 
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(a)  Ratings  (b)  Genres 

Figure  4.2:  Test  errors  (MAE)  for  predicting  whether  a  movie  was  rated,  and  the  genre,  on 
the  dense  rating  example.  Error  bars  are  2-standard  deviations  wide. 

mixing  parameters  {o:iV2) .  a:<Xi) .  Q:t24)}  there  was  no  statistically  significant  improvement, 
regardless  of  whether  I  sRated  or  Rating  was  used. 


4.5.2  Stochastic  Approximation 

Claim'.  On  the  augmented  collaborative  filtering  data,  stochastic  Newton  projection  can  be 
used  to  train  a  model  with  test  error  comparable  to  that  of  a  model  trained  using  Newton 
projection,  while  using  substantially  less  computation. 

Our  claim  regarding  stochastic  optimization  is  that  it  provides  an  efficient  alternative 
to  Newton  updates  in  the  alternating  projections  algorithm.  Since  our  interest  is  in  the  case 
with  a  large  number  of  observed  entries  in  each  relation,  we  use  the  I  sRated  relation. 
There  are  ri\  =  10000  users,  n2  =  2000  movies,  and  n3  =  22  of  the  most  common  genres 
in  the  data  set.  The  mixing  coefficient  is  set  close  to  the  optimum  value  in  Figures  4.2(a) 
and  4.3(a),  namely  a  =  0.5.  We  set  the  embedding  dimension  of  both  factorizations  to 
k  =  30. 

On  this  three  factor  problem  we  learn  a  collective  matrix  factorization  using  both  New¬ 
ton  and  stochastic  Newton  methods  with  batch  sizes  of  25,  75,  and  100  samples  per  row. 
The  batch  size  is  larger  than  the  number  of  genres,  and  so  they  are  all  used.  Our  primary 
concern  is  sampling  the  larger  user- movie  matrix.  Using  Newton  steps,  ten  cycles  of  al¬ 
ternating  projection  are  used;  using  stochastic  Newton  steps  thirty  cycles  are  used.  After 
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(a)  Ratings  (b)  Genres 


Figure  4.3:  Test  errors  (MAE)  for  predicting  whether  a  movie  was  rated,  and  the  genre,  on 
sparse  rating  example.  Error  bars  are  2-standard  deviations  wide. 


each  cycle,  we  measure  the  training  loss  (log-loss)  and  the  test  error  (mean  absolute  er¬ 
ror),  which  are  plotted  against  the  CPU  time  required  to  reach  the  given  cycle  in  Figure  4.4. 
The  error  bars  are  2-standard  deviations,  and  are  computed  by  repeating  the  experiment 
five  times. 

Using  only  a  small  fraction  of  the  data  we  achieve  results  comparable  to  what  full 
Newton  achieves  after  five  iterations.  At  batch  size  100,  we  are  sampling  1%  of  the  users 
and  5%  of  the  movies;  yet  our  performance  on  test  data  is  the  same  as  a  full  Newton  step 
given  8x  longer  to  run.  Diminishing  returns  with  respect  to  batch  size  suggests  that  using 
very  large  batches  is  unnecessary.9 

It  should  be  noted  that  rat  ing  is  a  computationally  simpler  problem  than  israted. 
On  a  three  factor  problem  with  rg  =  100000  users,  n2  =  5000  movies,  and  n3  =  21 
genres,  with  over  1.3M  observed  ratings,  alternating  projection  with  full  Newton  steps 
runs  to  convergence  in  32  minutes  on  a  single  1.6  GHz  CPU.  We  use  a  small  embedding 
dimension,  k  =  20,  but  one  can  exploit  common  tricks  for  large  Hessians.  We  used  the 
Poisson  link  for  ratings,  and  the  logistic  for  genres;  convergence  is  typically  faster  under 
the  identity  link,  although  predictive  performance  suffers. 


9Even  if  the  batch  size  were  equal  to  max{rg .  n2.  rg)  stochastic  Newton  would  not  return  the  same 
result  as  full  Newton  due  to  the  1/r  damping  factor  on  the  sample  Hessian. 
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Figure  4.4:  Behaviour  of  Newton  vs.  Stochastic  Newton  on  a  three-factor  model. 

4.5.3  Comparison  to  pLSI-pHITS 

Caveat :  Collective  Matrix  Factorization  makes  no  guarantees  with  respect  to  generaliza¬ 
tion  error. 

In  this  section  we  provide  an  example  where  the  additional  flexibility  of  collective 
matrix  factorization  leads  to  better  results;  and  another  where  a  co-clustering  model,  pLSI- 
pHITS,  has  the  advantage. 

pLSI-pHITS  [24]  is  a  relational  clustering  technique  which  is  specific  to  the  three 
entity-type  schema  Si  ~  S2  ~  S3:  Si  —  words  and  S2  =  S3  =  documents,  but  it  is 
trivial  to  allow  S2  ^  S3. 10  The  data  matrix  X!l'2i  contains  frequency  of  word  occurrence 
in  documents;  data  matrix  X (23'  contains  indicators  of  whether  or  not  a  document  cites 
another.  The  likelihood  maximized  by  pLSI-pHITS  is 

C  =  a  (X(12)  o  log  (UVT))  +  (1  -  a)  (X(23)  o  log  ( V ZT ))  ,  (4.19) 

where  the  parameters  U,  V,  and  Z  contain  probabilities: 

urn  =  p(xf]  |  he), 
vu  =  p(he  |  x[2)), 
za  =  p(xf’  |  he), 

10We  do  not  require  the  citing  and  cited  documents  be  the  same.  The  two  sets  in  general,  need  not  be  the 
same.  Moreover,  the  words  used  in  the  citing  documents  need  not  cover  the  words  in  the  cited  documents. 
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for  clusters  {hi, . . . ,  hk}.  Probability  constraints  require  that  each  column  of  U ,  VT ,  and 
Z  must  sum  to  one,  which  induces  a  clustering  of  entities.  Since  different  entities  can  par¬ 
ticipate  in  different  numbers  of  relations  (e.g.,  some  words  are  more  common  than  others) 
the  data  matrices  XiVI)  and  Xr2:i}  are  usually  normalized;  we  can  encode  this  normal¬ 
ization  using  weight  matrices.  The  objective,  Equation  4.19,  is  the  weighted  average  of 
two  probabilistic  LSI  [52]  models  with  shared  latent  factors  hk.  pLSI  is  one  of  the  single 
matrix  models  we  describe  in  Table  3.1. 

We  sample  two  instances  of  IsRated,  controlling  for  the  number  of  ratings  each 
movie  has.  The  sampling  is  as  follows:  sample  a  chosen  number  of  users,  among  those 
users  who  have  rated  at  least  one  movie.  Then,  retain  only  the  movies  rated  by  at  least 
one  selected  user.  We  created  a  dense  data  set  by  sampling  1000  users,  In  the  dense  data 
set,  the  median  number  of  ratings  per  movie  (user)  is  11  (76);  in  the  sparse  data  set,  the 
median  number  of  ratings  per  movie  (user)  is  2  (4).  In  both  cases  there  are  1000  randomly 
selected  users,  and  4975  randomly  selected  movies,  all  the  movies  in  the  dense  data  set. 

Since  pLSI-pHITS  is  a  co-clustering  method,  and  our  collective  matrix  factorization 
model  is  a  relation  prediction  method,  we  choose  a  measure  that  favours  neither  inher¬ 
ently:  ranking.  We  induce  a  ranking  of  movies  for  each  user,  measuring  the  quality  of 
the  ranking  using  mean  average  precision  [47]:  queries  correspond  to  user’s  requests  for 
ratings,  “relevant”  items  are  the  movies  of  the  held-out  links,  we  use  only  the  top  200 
movies  in  each  ranking11,  and  the  averaging  is  over  users.  Most  movies  are  unrated  by 
any  given  user,  and  so  relevance  is  available  only  for  a  fraction  of  the  items:  the  absolute 
mean  average  precision  values  will  be  small,  but  relative  differences  are  meaningful.  We 
compare  four  different  models  for  generating  rankings  of  movies  for  users: 

>  CMF-Identity:  Collective  matrix  factorization  using  identity  prediction  links,  f\  {())  = 
f2(0)  =  6  and  squared  loss.  Full  Newton  steps  are  used.  The  regularization  and  opti¬ 
mization  parameters  are  the  same  as  those  described  in  Section  4.5,  except  that  the 
smallest  step  length  is  77  =  2~5.  The  ranking  of  movies  for  user  i  is  induced  by  Ui.VT . 

>  CMF-LOGISTIC:  Like  CMF-Identity,  except  that  the  matching  link  and  loss  correspond 
to  a  Bernoulli  distribution,  as  in  logistic  regression:  fi{9)  =  f2(0)  =  1  /  (1  +  exp-0). 

>  pLSI-pHITS:  Described  above.  We  give  the  regularization  advantage  to  pLSI-pHITS: 
the  amount  of  regularization  f3  G  [0, 1]  is  chosen  at  each  iteration  using  tempered  EM. 
The  smaller  f3  is,  the  stronger  the  parameter  smoothing  towards  the  uniform  distribution. 
We  are  also  more  careful  about  setting  (5  than  Cohn  et.  al.  [24],  using  a  decay  rate  of 
0.95  and  minimum  (3  of  0.7.  To  have  a  consistent  interpretation  of  iterations  between 

11  The  relations  between  the  curves  in  Figure  4.5  are  the  same  if  the  rankings  are  not  truncated. 
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(a)  Dense  (b)  Sparse 


Figure  4.5:  Ranking  movies  for  users  on  a  data  set  where  each  movie  has  many  ratings 
(dense)  or  only  a  handful  (sparse).  The  methods  are  described  in  Section  4.5.3.  Error  bars 
are  1 -standard  deviation  wide. 


this  method  and  CMF,  we  use  tempering  to  choose  the  amount  of  regularization,  and 
then  fit  the  parameters  from  a  random  starting  point  with  the  best  choice  of  j3.  Movie 
rankings  are  generated  using  the  predicted  p(movie  \  user )  =  p{x<y2]  \  x W). 

>  Pop:  A  baseline  method  that  ignores  the  genre  information.  It  generates  a  single  ranking 
of  movies,  in  order  of  how  frequently  they  are  rated,  for  all  users. 

In  each  case  the  models,  save  popularity  ranking,  have  embedding  dimension  A:  =  30 
and  run  for  at  most  10  iterations.  We  compare  on  a  variety  of  values  of  a,  but  we  make 
no  claim  that  mixing  information  improves  the  quality  of  rankings.  Since  a  is  a  free 
parameter  we  want  to  confirm  the  relative  performance  of  these  methods  at  several  values. 
In  Figure  4.5,  collective  matrix  factorization  significantly  outperforms  pFSI-pHITS  on  the 
dense  data  set;  the  converse  is  true  on  the  sparse  data  set.  Ratings  do  not  benefit  from 
mixing  information  in  any  of  the  approaches,  on  either  data  set.12  While  the  flexibility  of 
collective  matrix  factorization  has  its  advantages,  especially  computational  ones,  we  do 
not  claim  unequivocal  superiority  over  relational  models  based  on  matrix  co-clustering. 


12We  note  that,  using  a  cluster  quality  criterion,  on  a  different  data  set,  Cohn  and  Hofmann  [24]  was  able 
to  show  that  mixing  improves  prediction  on  pLSI-pHITS. 


78 


Chapter  5 

Hierarchical  Bayesian  Collective  Matrix 
Factorization 

5.1  Introduction 


In  the  previous  chapter,  we  defined  a  fully  probabilistic  model  for  sets  of  related  matrices: 
Collective  Matrix  Factorization.  The  basic  approach  involves  simultaneously  finding  a 
low-rank  representation  for  a  set  of  related  matrices,  where  shared  dimensions  in  the  ma¬ 
trices  correspond  to  shared  low-rank  factors.  We  reduced  the  parameter  estimation  prob¬ 
lem  to  finding  the  posterior  mode  of  the  parameters,  which  is  referred  to  as  regularized 
maximum  likelihood  or  maximum  a  posteriori  inference. 

Collective  Matrix  Factorization  inherits  both  the  benefits  and  some  of  the  limitations 
of  maximum  likelihood  single  matrix  factorization:  (i)  the  model  is  a  point  estimate  that 
ignores  uncertainty  in  the  posterior  distribution  over  parameters.  Since  data  matrices  are 
often  sparse,  and  the  standard  consistency  arguments  in  favour  of  maximum  likelihood  do 
not  apply,  the  posterior  may  assign  significant  mass  to  many  models;  (ii)  strong  generaliza¬ 
tion  (prediction  involving  new  entities)  is  not  statistically  well  defined  (Section  3.9);  (iii) 
the  Gaussian  prior  over  factor  rows,  proposed  by  Gordon  [45],  for  E-PCA,  requires  that  the 
value  of  the  hyperparameters  0  be  assigned  before  parameter  estimation  can  begin.  If  we 
want  to  go  beyond  the  naive  zero-mean  assumption,  we  are  forced  to  search  over  at  least 
a  (k  +  1) -dimensional  space  for  0  (i.e.,  k  parameters  for  the  mean  and  a  spherical  covari¬ 
ance  matrix).  Each  point  in  the  grid  search  over  hyperparameter  values  involves  learning 
a  Collective  Matrix  Factorization  on  held-out  data,  which  is  computationally  expensive. 

A  subtle  criticism  of  Collective  Matrix  Factorization  is  that  information  cannot  be 
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easily  pooled  across  rows,  or  across  columns.  Since  the  prior  distribution  over  factor  rows 
is  fixed,  the  only  way  two  rows  of  a  factor  can  influence  each  other  is  indirectly:  e.g.,  two 
rows  Ui.  and  Uj.,  i  j,  can  only  influence  each  other  through  their  effect  on  V.  Learning 
the  prior  automatically,  through  a  hierarchical  prior,  allows  for  information  to  be  pooled 
across  the  rows  of  a  factor. 

A  subtler  criticism  of  Collective  Matrix  Factorization  centers  on  what  sort  of  infor¬ 
mation  is  transferred.  Take  the  three  entity-type  example,  where  the  posterior  distribu¬ 
tion  is  p(U,  V,  Z  |  A",  Y,  0).  Consider  the  analogous  posterior  involving  only  the  X  data 
matrix:  p(U,  V  \  X,  QU:  0y).  In  each  case  we  can  compute  the  marginal  distribution  of 
an  element  of  a  factor,  say  Ua  G  U.  If  we  compare  the  posterior  distribution  over  Ua 
in  the  single  and  two-matrix  cases,  it  is  clear  that  the  two  distributions  p{Ua  \  X,  Y,  0) 
and  p(Ua  |  A",  Oy,  0 v )  can,  and  usually  will,  differ.  Under  maximum  a  posteriori  infer¬ 
ence,  the  only  difference  we  see  between  the  two  distributions  is  a  difference  in  the  mode: 
changes  in  the  variance,  skew,  and  other  properties  of  the  distribution  are  not  accounted 
for.  The  only  way  Y  can  affect  the  prediction  of  entries  in  X  is  in  the  effect  it  has  on  V  and 
U.  The  only  way  X  can  affect  the  prediction  of  entries  in  Y  is  in  the  effect  it  has  on  V  and 
Z.  Changes  in  the  posterior  mode  account  for  some  of  the  effect  of  information  sharing 
between  matrices;  changes  in  the  posterior  distribution  account  for  all  of  the  effect. 

A  motivating  example.  Functional  Magnetic  Resonance  Imaging  (fMRI)  is  often  used 
to  measure  responses  in  small  regions  of  the  brain  (i.e.,  voxels)  given  external  stimuli. 
Given  enough  experiments  on  a  sufficiently  broad  range  of  stimuli,  one  can  build  models 
that  predict  patterns  of  brain  activation  given  new  stimuli  [83].  Running  enough  subjects 
through  an  fMRI  scanner,  on  a  wide  variety  of  stimuli,  is  costly.  However,  we  can  often 
collect  cheap  side  information  about  the  stimuli.  In  this  chapter,  we  consider  an  experiment 
where  the  stimulus  is  a  word-picture  pair  displayed  on  a  screen.  We  can  collect  statistics 
of  whether  the  stimulus  word  co-occurs  with  other  commonly  used  words  in  large,  freely 
available  text  corpora.  By  integrating  the  two  sources  of  information,  related  through  the 
shared  set  of  stimulus  words,  we  hope  to  improve  the  quality  of  predictive  models  of  brain 
activation. 

Main  Contributions:  In  this  chapter,  we  propose  a  hierarchical  Bayesian  variant  of  Col¬ 
lective  Matrix  Factorization,  which  addresses  the  four  criticisms  of  the  maximum  like¬ 
lihood  approach,  described  above.  We  design  a  Metropolis-Hastings  algorithm  for  sam¬ 
pling  from  the  posterior  by  cyclically  sampling  from  Bayesian  Generalized  Linear  Models. 
Were  we  to  use  a  naive  random-walk  proposal  in  a  factored  Metropolis-Hastings,  the  al¬ 
gorithm  would  be  a  standard  block  MCMC  sampler;  the  Markov  Chain  would  also  be 
slow  to  mix  (i.e.,  slow  training).  Using  what  we  already  know  about  efficiently  computing 
the  gradient  and  per-row  Hessians  of  the  regularized  likelihood,  we  develop  an  adaptive 
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lective  Matrix  Factorization 

Figure  5.1:  Plate  representations  of  collective  matrix  factorization  (Chapter  4)  and  its 
analogue  with  hierarchical  priors  (Section  5.2).  Shaded  nodes  indicate  observed  data  or 
quantities  which  must  be  selected.  Dashed  nodes  are  fixed  parameters.  Weight  matrices 
W  and  W  are  elided. 


proposal  distribution  for  Metropolis-Hastings.  The  resulting  algorithm  (i)  avoids  the  need 
for  tedious  hand  tuning  of  a  large  number  of  proposal  distributions;  (ii)  is  fast  enough  to 
handle  >20,000  entities  in  a  few  minutes.  Unlike  many  comparable  techniques,  we  make 
no  limiting  conjugacy  assumptions  on  our  model:  every  maximum  likelihood  Collective 
Matrix  Factorization  has  a  Bayesian  analogue,  which  we  can  train  using  our  adaptive  block 
Metropolis-Hastings  sampler.  In  our  experiments,  we  show  that  this  approach  can  be  used 
to  augment  costly  fMRI  data  with  inexpensive  word  co-occurrence  data,  improving  pre¬ 
dictions  of  brain  activity  even  when  the  voxel  being  tested  never  appeared  in  the  training 
set. 


5.2  Hierarchical  Collective  Matrix  Factorization 

In  this  section,  we  extend  Collective  Matrix  Factorization  (Figure  5.1(a))  to  include  hier¬ 
archical  priors  (Figure  5.1(b)).1 

1  We  remind  the  reader  that  Ui,  V3 ,  and  Zv  are  factor  rows,  fc -dimensional  vectors  of  random  variables; 
and  that  /i  is  a  vector  of  normal  mean  parameters,  and  E  a  covariance  matrix. 
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Collective  Matrix  Factorization  requires  choosing  a  good  fixed  value  for  the  hyperpa¬ 
rameters  0  =  {0(7,  @y,  Qz},  where  VF  e  {U,  V,  Zj,  Of  =  {/u f ,  Uf}-  Finding  a  good 
value  for  0  is  a  difficult  task:  even  if  we  assume  that  the  prior  means  are  zero  and  the 
prior  covariances  spherical,  cross-validation  involves  searching  over  the  scales  of  the  three 
covariance  matrices. 

Another  issue  with  Collective  Matrix  Factorization  is  that  information  between  rows 
of  a  factor  can  only  be  shared  indirectly,  through  another  factor.  Consider  the  plate  model 
in  Figure  5.1(a).  Using  ^/-separation  [94],  it  is  easy  to  deduce  that  if  only  one  factor  is 
free  (say  U )  then  the  rows  of  that  factor  are  independent  of  one  other.  Independence 
between  factor  rows  allows  us  to  reduce  the  projection  over  a  large  factor  matrix  into 
parallel  optimizations  over  each  row  of  that  factor.  Computationally,  the  independence  of 
rows  in  the  free  factor  is  useful.  Statistically,  we  want  a  more  direct  way  of  pooling  shared 
behaviour  across  rows  or  columns  of  a  matrix. 

Since  we  do  not  usually  have  any  reason  to  know  the  value  of  the  Gaussian  hyper¬ 
parameters  0,  a  sensible  approach  is  to  treat  them  as  quantities  to  be  learned,  placing  an 
appropriate  weak  prior  on  them:  i.e,  a  hierarchical  model.  Hierarchical  priors  address  both 
the  problem  of  selecting  0,  and  the  desire  for  pooling  information  within  a  matrix.  More¬ 
over,  we  can  design  the  hierarchical  priors  so  that  independence  of  factor  rows  is  preserved 
under  alternating  projections. 

The  natural  approach  is  to  place  hierarchical  priors  on  Ou,  Ov,  and  Oz  separately. 
This  approach  makes  it  easy  to  train  the  resulting  model  under  alternating  projections. 
Moreover,  our  approach  follows  naturally  from  the  nature  of  row-column  exchangeable 
matrices  (Section  3.5.2).  Any  row-column  exchangeable  matrix  can  be  viewed  either  as 
a  collection  of  exchangeable  rows,  or  as  a  collection  of  exchangeable  columns.  If  we 
consider  the  matrix  factorization  X  &  f(UVT),  exchangeable  rows  justify  the  use  of  a 
hierarchical  prior  on  (7;  exchangeable  columns  justify  the  use  of  a  hierarchical  prior  on 
U.2 

We  elect  to  use  the  conjugate  prior  for  (//,  S):  the  normal-inverse- Wishart  distribu¬ 
tion  [38].  The  normal-inverse- Wishart  prior  on  0p  =  {///r,  is  defined  by  first  sam¬ 
pling  the  covariance  from  a  Wishart  distribution,  VV,  then  conditionally  sampling  the  mean 
from  a  Gaussian  distribution,  J\f: 


Up1  ~  \V(uf,  'k F ), 


2de  Finetti’s  representation  theorem  for  infinitely  exchangeable  sequences  of  random  variables  is  com¬ 
monly  used  to  motivate  hierarchical  priors  (see  Gelman  et  al.  [38],  Chapter  5). 
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The  fixed  hyperprior  parameters  vF  >  k,  \ VF  e  M+Xfc,  £F  e  Bk.  30  >  0  arc  chosen 
by  the  modeler.  Our  choice  of  hierarchical  priors  for  matrix  factorization  mirrors  that  of 
Salakhutdinov  and  Mnih  [108],  though  we  do  not  make  the  simplifying  assumption  that 
the  likelihood  is  conjugate  to  the  hierarchical  prior.  Choosing  other  priors  that  exploit 
sparsity,  block  structure,  or  other  properties  remains  a  topic  of  future  work.3 

In  addition  to  treating  things  that  are  unknown  as  unknown,  the  hierarchical  prior  acts 
as  a  shrinkage  estimator  for  the  rows  of  a  factor.  Shrinkage  may  be  especially  useful 
when  some  of  the  entities  are  associated  with  only  a  few  observations:  in  the  absence 
of  much  data  we  let  the  low-rank  representation  of  an  entity  tend  towards  the  population 
mean.  Shrinkage  estimators  pool  information  about  rows  or  columns  of  a  matrix,  indi¬ 
rectly,  through  the  latent  hyperparameters. 

It  is  not  hard  to  generalize  our  maximum  likelihood  decomposition  approach  to  the 
hierarchical  case  (Algorithm  2).  Simply  alternate  between  optimizing  the  parameters  given 
fixed  hyperparameters  (Section  4.3.1),  and  optimizing  the  hyperparameters  given  fixed 
parameters.  Given  fixed  F  6  T,  the  most  probable  value  of  (/j,f,  Er)  is  the  mode  of  a 
normal-Inverse-Wishart  distribution  with  the  following  parameters: 


(5.1) 


As  discussed  earlier  in  this  thesis,  this  decomposition  approach  is  known  as  alternating 
projections  in  the  maximum  likelihood  case.  Below,  in  Section  5.3.1,  we  use  a  similar 
decomposition  of  the  model  parameters  to  develop  a  block  Metropolis-Hastings  sampler. 

To  derive  a  hierarchical  Bayesian  extension  to  Collective  Matrix  Factorization,  we  first 
consider  the  hierarchical  prior,  whose  addition  to  the  plate  representation  of  Collective 
Matrix  Factorization  is  depicted  in  Figure  5.1(b).  In  the  hierarchical  plate  model,  there  are 
several  quantities  to  consider: 

>  The  data  variables,  V  =  {X,  Y}. 

>  The  data  weights  on  X  and  Y:  W  and  W,  respectively. 

>  The  parameters  F  =  { U.  V.  Z } 

3While  we  have  focused  on  G-regularization,  all  variants  of  Collective  Matrix  Factorization  work  with 
any  prior  whose  probability  density  function  is  twice-differentiable  and  decomposable. 
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>  The  hyperparameters,  parameters  of  the  prior  on  T\  0  =  {nu,  £[/,  Pv-  pz,  Yz}. 

>  Hyperprior  parameters,  the  parameters  of  the  prior  on  0: 

00  =  \Fu i  ^ U ,  £(7,  VZ,  'hz,  £z,  P o}- 

0O  is  fixed,  chosen  by  the  user.  In  all  our  experiments,  for  each  factor  F,  uF  =  k  is  the 
embedding  dimension,  is  a  k  x  k  identity  matrix,  £j/  =  0,  and  /30  =  1.  This  choice 
of  hyperprior  parameters  encodes  a  very  weak  preference  for  the  prior  over  factor  rows 
to  be  a  ^'-dimensional  Gaussian  with  mean  0  and  identity  covariance.  Since  this  is  a 
prior  over  factor  rows,  and  the  number  of  rows  depends  on  the  size  of  the  data  matrices, 
our  choice  of  0O  becomes  less  consequential  as  A"  and  Y  get  larger. 

Factoring  the  posterior  distribution  over  {F.  0}  according  to  the  hierarchical  model  in 
Figure  5.1(b): 

p(f,  0 1  x,  y,  w,  w,  @0)  =  P(u,  v  |  x,  w)P(v,  z  |  y,  w)  J]  P(F  \  eF)P(eF  |  0O). 

Fe{u,v,z} 

(5.2) 

We  shall  often  need  to  work  in  log-space,  where  Equation  5.2  is  equal  to 

0  =  £+^logp(0F|0o).  (5.3) 

FdT 

The  objective  for  non-hierarchical  Collective  Matrix  Factorization,  £,  is  defined  in  Equa¬ 
tion  4.9.  The  distribution  over  0^  is  the  normal-inverse- Wishart,  defined  above.  Given 
that  we  have  already  derived  the  gradient  and  Hessian  of  jC,  deriving  the  equations  for  the 
gradient  and  Hessian  of  O  are  straightforward. 

5.2.1  Generalization  to  an  Arbitrary  Number  of  Relations 

As  with  the  previous  chapter,  we  focus  on  the  two  matrix  case.  However,  Algorithms  2- 
4  are  presented  for  the  general  case,  which  can  involve  any  number  of  related  matrices. 
Here,  we  remind  the  reader  of  the  general  notation.  Entity-types,  the  different  kinds  of 
relation  arguments,  are  indexed  by  i  =  1 . . .  t.  The  number  of  entities  of  type  i  in  the 
training  set  is  denoted  ny  A  matrix  corresponding  to  a  relation  between  entity-types  i  and 
j  is  denoted  Each  relation  matrix  is  represented  as  the  product  of  low-rank  factors: 

XM  w  f{ij)  (u{i)  ( UU))T )  , 
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where  is  the  element-wise  link  function  that  transforms  the  low-rank  latent  represen¬ 
tation  into  predictions.  Each  factor  has  its  own  Gaussian  prior,  defined  over  factor 
rows.  In  the  hierarchical  case,  each  of  these  priors  is  assigned  a  normal-Inverse-Wishart 
hyperprior. 


5.3  Bayesian  Inference  for  the  Hierarchical  Model 


A  limitation  of  both  collective  matrix  factorization  and  its  hierarchical  extension  is  that 
only  point  estimates  are  possible.  There  may  be  substantial  uncertainty  in  the  parameter 
estimates,  especially  when  each  entity  only  participates  in  a  few  relationships.  Failing  to 
model  this  uncertainty  often  leads  to  worse  accuracy  at  prediction.  Unlike  most  debates 
on  the  merits  of  maximum  likelihood  vs.  Bayesian  estimation,  the  former  cannot  invoke 
asymptotic  consistency  here:  even  in  the  limit  of  infinite  data,  the  number  of  parameters 
grows  with  the  number  of  entities. 

Another  limitation  of  maximum  likelihood  Collective  Matrix  Factorization  is  that  it  is 
of  limited  value  in  predicting  behaviour  of  a  new  entity,  one  not  present  in  the  training 
data.  The  standard  approach  to  “folding-in”  a  new  row  in  a  data  matrix  involves  finding 
the  maximum  likelihood  solution  for  the  low-dimensional  representation  of  the  new  row, 
fixing  the  previously  trained  parameters  and  hyperparameters  (Section  5.3.3).  The  bilinear 
nature  of  the  model  makes  fixing  one  factor  to  its  maximum  likelihood  value  problematic. 
Consider  folding-in  a  new  row  of  a  data  matrix  Xnew..  A  particular  value  of  V  defines 
a  distribution  over  the  k  factors  for  each  column  of  X.  If  we  fit  UneWt.  conditioned  on  a 
particular  fixed  value  of  V,  then  we  are  assuming  that  the  chosen  value  of  V  is  the  true 
distribution  over  factors  for  each  column  entity.  In  practice,  there  may  be  many  plausible 
values  of  V  that  differ  enough  to  affect  the  predictions  Xnew.  =  j\  (UnewYr  ).  Fixing  V 
to  a  single  value  ignores  the  possibility  of  other  plausible  models  that  may  better  fit  the 
behaviour  of  a  particular  row  entity.4 

To  address  these  concerns,  we  propose  a  fully  Bayesian  approach  to  training  the  hi¬ 
erarchical  model  introduced  in  Section  5.2.  Instead  of  approximating  the  posterior  in 
Equation  5.2  by  its  mode,  we  approximate  it  using  multiple  samples  drawn  from  it. 


4This  argument  is  akin  to  the  one  that  is  used  to  explain  why  Latent  Dirichlet  Allocation  is  better  at 
prediction  on  new  documents  than  pLSI.  That  latter  assumes  a  fixed  distribution  over  topics;  the  former 
integrates  over  the  topic  simplex,  a  distribution  over  distributions  of  topics. 
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5.3.1  Block  Metropolis-Hastings 


In  finding  the  maximum  a  posteriori  solution  for  hierarchical  CMF,  we  exploited  structure 
in  the  large  number  of  unknowns  (F,  0}  to  make  the  optimization  tractable.  First  we 
alternated  between  optimizing  over  T  and  0.  When  optimizing  over  T,  we  alternated 
between  optimizing  each  factor  F  e  T.  When  optimizing  over  a  factor  F,  we  reduce  the 
computation  to  parallel  updates  over  each  row  of  F,  and  so  the  hardest  part  of  inference 
is  optimizing  a  convex  function  over  k  -C  min {m,n,r}  variables.  In  Newton-projection 
over  a  factor  row,  Ft.,  the  core  operation  involves  finding  a  step  towards  the  most  likely 
value  of  Fi.,  given  that  the  other  parameters  are  fixed. 

In  a  block  Metropolis-Hastings  sampler  (Section  2.5.1)  we  might  instead  consider  sam¬ 
pling  from  the  conditional  distribution  of  Fv,  given  that  the  other  parameters  are  fixed.  If 
it  is  feasible  to  sample  from  this  distribution,  or  to  sample  from  an  approximation  of  this 
distribution,  we  have  arrived  at  a  block  Markov  Chain  Monte  Carlo  sampler. 

Let  Q  =  {F .  0}  denote  the  parameters  we  wish  to  sample  over.  We  choose  the  fol¬ 
lowing  conditional  sampling  distributions  for  block  Metropolis-Hastings5, 


WF  e{U,V,Z}Vi  =  l...nF:  Ft.  ~  p(Ft  \  Q  -  Ft.), 

vf  e  {u,  v,  z}  -.  eF  ~  p(oF  |  n  -  eF). 


(5.4) 

(5.5) 


Sampling  directly  from  the  distribution  in  Equation  5.5  is  straightforward:  the  distribution 
of  Qf  is  normal-inverse- Wishart,  so  the  conditional  sampling  distribution  has  a  closed 
form  (Section  5.2).  In  contrast,  the  form  of  the  conditional  sampling  distribution  in  Equa¬ 
tion  5.4  is  unknown,  unless  we  assume  that  X  and  Y  are  Gaussian:  i.e.,  the  data  distribu¬ 
tions  and  factor  priors  are  mutually  conjugate. 

5.3.2  Hessian  Metropolis-Hastings 

If  we  assume  that  the  elements  of  X  and  Y  are  Gaussian,  then  Metropolis-Hastings  can 
be  reduced  to  Gibbs  sampling,  leading  to  a  multiple-matrix  generalization  of  Bayesian 
Probabilistic  Matrix  Factorization  [109]. 

However,  in  almost  all  the  examples  of  information  integration  we  consider,  the  en¬ 
tries  of  different  data  matrices  have  different  response  types:  e.g.,  in  the  fMRI  domain, 
word  co-occurrence  is  binary,  but  voxel  activation  is  continuous.  We  do  not  wish  to  sac¬ 
rifice  the  flexibility  of  our  model,  which  supports  multiple  response  types,  to  reap  the 

5We  note  that  the  way  parameters  are  grouped  in  block  Metropolis-Hastings  is  similar  to  the  way  they 
are  grouped  in  alternating  Newton  projections. 
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benefits  of  Bayesian  inference.  Therefore,  we  consider  more  general  Metropolis-Hastings 
approaches. 

In  Metropolis-Hastings,  one  often  resorts  to  sampling  from  a  Gaussian  proposal  distri¬ 
bution  whose  mean  is  the  sample  at  time  t,  Fp ,  with  covariance  matrix  vt  ■  I.  This  forces 
the  user  to  choose  v,:  for  each  row  in  a  way  that  causes  the  underlying  Markov  chain  to 
mix  quickly.  Tuning  one  proposal  distribution  is  tedious;  tuning  a  proposal  distribution  for 
each  entity  is  masochistic.  If  the  Hessian  of  the  target  distribution,  with  respect  to  F-r'\  is 
far  from  spherical,  then  the  rate  at  which  the  underlying  Markov  chain  mixes  can  be  slow, 
regardless  of  how  well  V{  is  tuned. 

The  distribution  in  Equation  5.4  may  not  be  easy  to  sample  from;  but  given  a  point, 
namely  F^\  we  can  easily  compute  the  local  gradient  and  Hessian  of  the  distribution  with 
respect  to  F,..  By  using  the  gradient  and  Hessian,  we  can  create  a  proposal  distribution 
that  better  approximates  p(-F).  Q  —  F,.  ). 

In  the  case  of  Metropolis-Hastings  for  a  Bayesian  Generalized  Linear  Model,  one  tech¬ 
nique  that  uses  both  the  gradient  and  Hessian  to  dynamically  construct  the  proposal  is  Hes¬ 
sian  Metropolis-Hastings  (HMH)  [98].  To  define  a  Metropolis-Hastings  sampler,  we  need 
to  define  the  forward  sampling  distribution,  from  which  a  proposal  value  F>  '  is  drawn 
given  the  previous  value  in  the  chain,  F-  l> .  The  choice  of  a  forward  sampling  distribution 
leads  to  a  corresponding  backward  sampling  distribution,  which  defines  the  probability  of 
returning  to  F>  from  the  proposal  value  F>  .  The  forward  sampling  distribution  in  HMH 
is  a  Gaussian  whose  mean  is  determined  by  taking  one  Newton  step  from  F^\  and  whose 
covariance  is  derived  from  the  Hessian  used  to  take  the  Newton  step.  The  backward  sam¬ 
pling  distribution  in  HMH  is  a  Gaussian  whose  mean  is  determined  by  taking  one  Newton 
step  from  F>  ,  and  whose  covariance  is  derived  from  the  Hessian  used  to  take  the  New¬ 
ton  step.  Instead  of  searching  over  step  lengths  //,  we  sample  from  a  fixed  distribution 
over  step  lengths  (here,  uniformly  at  random).  Algorithms  3  and  4  describe  the  Hessian 
Metropolis-Hastings  sampler  for  Equation  5.4. 

Random-walk  Metropolis-Hastings  takes  a  dim  view  of  what  one  can  do  with  a  sam¬ 
pling  distribution:  it  assumes  only  that  we  can  efficiently  evaluate  probabilities,  up  to  a 
normalizing  constant.  If  we  assume  we  know  nothing  about  the  structure  of  the  sampling 
distribution,  then  there  is  little  we  can  do  with  the  Gaussian  forward  and  backward  sam¬ 
pling  distributions  other  than  pick  a  distribution  with  appropriate  variance,  centered  at  the 
required  point.  However,  since  we  have  a  efficient  Newton-projection  for  finding  the  mode 
of  p(Fi.  |  Q  —  F),  we  should  use  it  to  pick  a  proposal  that  is  closer  to  the  mode.  After  all, 
for  any  smooth  distribution  on  Ft.,  the  mode  is  in  a  region  of  high  probability. 
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5.3.3  Folding-in 


Training  only  provides  us  with  a  low-rank  representation  of  entities  in  the  training  set. 
To  deal  with  new  entities  we  must  “fold-in”  a  low-rank  representation  of  that  entity.  In 
maximum  likelihood  techniques,  we  have  a  fixed  set  of  parameters  and  hyperparameters, 
(JF,  0},  which  makes  the  process  straightforward.  Assume  we  are  adding  a  row  to  A", 
Xnew and  that  UneWt.  has  been  initialized6: 

1.  Fix  the  value  of  (JF,  0}  to  the  value  which  was  learned  during  training.  The  value 
of  Unew>.  remains  unfixed. 

2.  Using  Xnew>.  and  the  fixed  (JF,  0},  estimate  Unew>..  This  can  be  done  using  several 
steps  of  Newton  projection.  Essentially,  we  are  training  a  GLM  where  UneWt.  are  the 
weights. 

The  process  is  similar  when  Bayesian  inference  is  used,  except  that  instead  of  maximizing 
the  value  of  UneWy.,  we  sample  several  times  from  the  posterior  distribution 

PiJJnew,-  |  i  X ,  0,  XneWr), 

and  average  the  predictions  from  each  sample. 

5.3.4  Bayesian  Prediction 

Once  we  have  the  low-rank  representation  for  an  entity  (either  from  training  for  hold¬ 
out,  or  by  folding-in  for  new  entities)  we  want  to  predict  the  value  of  relations  that  entity 
participates  in.  We  consider  predicting  entry  X%r  For  a  point  estimate  the  prediction  is 
simply  X  =  f\  (Ut.  V?) .  Given  a  posterior  distribution  over  many  models,  we  must  average 
over  the  models: 

p(Xl3  I  V)  =  J p{ Xij  I  X,  G)p(X ,  0  I  X,  Y,  W,  W)  d{X,  0}.  (5.6) 

Equation  5.6  is  known  as  the  posterior  predictive  distribution.  Since  we  only  possess  sam¬ 
ples  from  the  posterior  distribution,  we  use  a  Monte  Carlo  approximation  to  the  posterior 
predictive: 

1  S 

p(A'„  \V)  =  -  V  p(  A',,  I  J*>\  e<*>),  (5.7) 

S=1 

6We  initialize  factor  rows  by  drawing  from  the  prior  distribution  over  the  factor’s  rows. 


88 


where  { ,  0(s) }  is  a  sample  from  the  posterior  distribution,  generating  using  Metropolis- 
Hastings.  We  control  the  number  of  samples  used,  S.  In  our  experiments,  S  =  10  on 
hold-out  experiments;  S'  =  5  on  fold-in  experiments.  The  predictive  performance  was  not 
significantly  improved  by  using  more  samples. 

5.3.5  Hypothesis-Specific  Bayesian  Model  Averaging 

Given  the  substantial  effort  involved  in  designing  hierarchical  Bayesian  Collective  Matrix 
Factorization  and  the  adaptive  Metropolis-Hastings  algorithm,  the  reader  may  question 
whether  the  prima  facie  case  for  Bayesianism  is  sufficiently  compelling.  The  case  for 
Bayesianism  rests  on  two  points: 

>  That  the  posterior  distribution  over  the  unknowns  {T.  0}  is  highly  uncertain,  assigning 
significant  mass  to  many  different  models.7 

>  That  generalizing  to  new  entities  using  the  posterior  distribution  is  not  subject  to  Welling’s 
criticism  of  matrix  factorization  models  that  generalize  to  new  entities  using  only  the 
posterior  mode  (Section  3.9). 

The  posterior  predictive  distribution,  discussed  in  the  previous  section,  is  the  standard 
Bayesian  solution  for  dealing  with  posterior  uncertainty.  Using  the  posterior  predictive 
distribution  is  precisely  Hypothesis-Specific  Bayesian  Model  Averaging  (Section  2.5.2) 
where  the  composite  hypothesis,  Tt,  is  the  set  of  distributions  defined  by  the  plate  model 
in  Figure  5.1(b).  The  unknowns  (JF,  0}  index  the  distributions  within  Tt. 

Hypothesis-Specific  Bayesian  Model  Averaging  (HS-BMA)  is  a  way  of  improving  the 
quality  of  predictions  given  Tt:''  HS-BMA  can  be  viewed  as  a  form  of  soft  model  selection, 
picking  the  best  element  of  Tt  subject  to  uncertainty  in  the  unknowns.  In  most  applications 
of  HS-BMA,  Tt  is  fixed  with  respect  to  the  data,  and  consistency  often  holds.  In  Bayesian 
matrix  factorization  models,  such  as  the  ones  discussed  in  this  chapter,  the  number  of 
parameters  required  to  describe  an  element  of  Tt  depends  on  the  size  of  the  training  data. 

All  low-rank  matrix  factorizations  can  suffer  from  substantial  uncertainty  in  their  un¬ 
knowns;  it  is  the  price  we  pay  to  model  entity- specific  interactions.  At  first  blush,  a  ma- 

7The  posterior  for  a  matrix  factorization  can  assign  mass  to  many  different  models  that  differ  only  in  a 
permutation  of  the  columns  of  T .  Our  concern  with  model  uncertainty  is  the  existence  of  multiple  high- 
scoring  models  that  are  not  identical  up  to  a  permutation  of  the  factors. 

8  If  one  wanted  to  be  pedantic,  H  could  be  added  as  a  conditioning  event  to  all  of  the  distributions 
discussed  in  this  chapter.  Likewise,  in  Chapter  4,  a  composite  hypothesis  for  Collective  Matrix  Factorization 
could  be  added  as  a  conditioning  event  to  each  distribution  involving  that  model. 
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trix  looks  very  much  like  a  table  of  attribute-value  data,  which  consists  of  a  fixed  set  of 
attributes  (columns)  and  an  exchangeable  collection  of  data  points  (rows).  The  critical  dif¬ 
ference  between  an  attribute-value  model  and  its  matrix  analogue  (e.g.,  generalized  linear 
models  and  Exponential  Family  PCA)  is  that  the  former  assigns  parameters  to  attributes; 
the  latter  assigns  parameters  to  entity-attribute  interactions.  A  consequence  of  this  differ¬ 
ence  is  that  while  attribute- value  models  can  often  provide  consistency  guarantees,  matrix 
factorizations  do  not.  The  advantage  of  matrix  factorization  models  is  that  they  can  repre¬ 
sent  dyadic  entity-attribute  interactions;  attribute-value  models  cannot. 

The  intuitive  meaning  of  a  consistency  guarantee  in  a  Bayesian  procedure  is  that  the 
posterior  distribution  over  unknowns  converges  to  a  point  mass  on  the  true  generating 
model,  given  sufficient  data.9  While  consistency  holds  for  many  probabilistic  attribute- 
value  models,  it  does  not  hold  for  matrix  factorization  models:  TL  is  dependent  on  the 
training  data.  The  observations  in  a  matrix  factorization  are  entries  in  the  data  matrix,  and 
so  the  only  way  the  number  of  observations  can  tend  to  infinity  is  if  (i)  one  dimension  is 
fixed  and  the  other  grows;  (ii)  both  rows  and  columns  grow  without  bound.  In  case  (i),  let 
the  columns  be  fixed.  Even  though  the  columns  are  fixed,  they  are  not  attributes:  matrix 
factorizations  assume  that  each  new  row  has  a  unique  dyadic  interaction  with  each  of  the 
(fixed)  columns.  In  case  (ii),  it  is  even  clearer  that  the  parameter  space  is  growing  with 
the  number  of  observations.  Under  these  circumstances,  one  should  expect  substantial 
uncertainty,  at  least  over  T .  We  do  not  know  whether  consistency  holds  for  the  posterior 
over  0;  the  analysis  is  complicated  by  the  fact  that  the  parameters  model  row-column  ex¬ 
changeable  observations  (matrices)  instead  of  exchangeable  observations  (data  points). 

The  case  for  Hypothesis-Specific  Bayesian  Model  Averaging  is  particularly  strong  in 
low-rank  matrix  factorization.  While  we  could  do  model  selection  using  the  posterior 
mode,  there  will  always  be  significant  posterior  uncertainty  over  the  low-rank  factors,  re¬ 
gardless  of  how  large  the  data  matrices  get.  The  problem  of  posterior  uncertainty  can  be 
even  worse  if  the  data  matrices  are  sparsely  observed.  We  want  to  do  model  selection 
over  {T .  0},  but  we  also  want  to  be  mindful  of  posterior  uncertainty  in  the  unknowns, 
especially  since  we  cannot  eliminate  such  uncertainty  by  throwing  more  data  at  the  prob¬ 
lem.  Using  the  posterior  predictive  distribution  allows  for  a  compromise,  i.e.,  soft  model 
selection.  In  our  experiments  (Section  5.4)  we  find  that  soft  model  selection  yields  better 
predictions  than  model  selection  over  our  choice  of  H:  the  Bayesian  solution  outperforms 
the  maximum  a  posteriori  solution. 


9We  assume  that  'H  contains  the  true  generating  model,  and  that  the  true  generating  model  is  assigned 
non-zero  probability  in  the  prior. 
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Variants  of  Hypothesis-Specific  Bayesian  Model  Averaging 


We  choose  to  integrate  out  the  uncertainty  in  both  low-rank  factors  T  and  hyperparameters 
0  in  the  posterior  predictive  distribution  (Equation  5.6).  One  may  choose  to  treat  the 
unknowns  differently:  e.g.,  integrate  out  the  uncertainty  over  T\  but  use  only  the  posterior 
mode  for  0.  In  our  setting,  the  update  for  0  is  an  analytic  Gibbs  step.  There  are  variants 
of  the  model  we  consider  where  the  update  for  0  is  not  analytic,  and  using  posterior 
mode  may  have  computational  advantages:  e.g.,  if  the  low-rank  factors  were  subject  to  a 
stochastic  or  non-negative  constraint,  with  the  same  Gaussian  prior  over  factor  rows. 


5.3.6  Exploiting  Decomposition 

There  is  a  strong  similarity  between  alternating  Newton-projection  for  maximum  like¬ 
lihood  inference  in  hierarchical  Collective  Matrix  Factorization,  and  the  adaptive  block 
Metropolis-Hastings  algorithm  we  propose  for  hierarchical  Bayesian  Collective  Matrix 
Factorization.  The  underlying  graphical  model  for  both  approaches  is  the  same  (Fig¬ 
ure  5.1(b)). 

One  way  of  viewing  the  decomposition  is  that  learning  a  bilinear  form,  UVT  and 
VZT,  is  reduced  into  cyclically  learning  linear  forms:  e.g.,  Ur.  when  V  is  fixed.  Feaming 
a  linear  model  is  a  well-studied  problem  in  machine  learning,  and  we  take  advantage  of 
that  in  this  chapter.  Once  the  decomposition  is  established,  we  create  a  Bayesian  algorithm 
for  collective  matrix  factorization  by  simply  replacing  a  (^-regularized  Generalized  Finear 
Model  (GFM)  [78]  with  its  Bayesian  analogue.  Hessian  Metropolis-Hastings  was  first 
proposed  for  solving  Bayesian  GFMs;  we  extend  its  use  to  Bayesian  collective  matrix 
factorization. 

We  note  that  there  are  other  ways  that  we  may  productively  use  ideas  from  linear 
models.  In  MAP  inference  we  have  found  that  sample-based  estimates  of  the  gradient  and 
Hessian  increase  the  initial  rate  of  convergence  during  training  [116].  We  speculate  that 
using  stochastic  gradients  and  Hessians  in  HMH  may  be  useful  in  accelerating  training,  in 
much  the  same  manner  that  mini -batches  are  used  to  accelerate  other  MCMC  techniques. 
Moreover,  the  lack  of  a  guarantee  on  decrease  in  the  objective  at  each  iteration  of  stochastic 
projection  does  not  affect  the  convergence  guarantees  of  Metropolis-Hastings:  a  bad  step 
can  affect  the  acceptance  rate,  but  not  convergence  guarantees. 
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Algorithm  2:  Decomposition  Algorithm  for  MAP  and  Bayesian  Inference 

Input:  Data  matrices,  one  per  relation  { Xlee  1 }.  The  model  (embedding  dimension  the  choice  of 
exponential  family  for  the  entries  of  each  matrix,  and  the  hyperparameters  if  they  are  fixed  in 
advanced). 

Output:  Low  rank  factors  for  each  entity-type:  fd1)  . . .  U^E\ 

for  e  =  1 ...  E  do 

Initialize  U^‘M>  using  Algorithm  3  with  prior  mean  /xe  =  0  and  Ee  =  I.  This  initialization 
works  well  for  either  MAP  or  Bayesian  inference. 

end 

while  not  converged  do 
for  e  =  1 ...  E  do 

foreach  row  ofU :  u\e’t]  do 

Update  using  u\e'1'1  and  all  the  observations  involving  the  current  entity:  i.e., 

Ve',  e,  x[ee  ’ .  For  MAP  inference  the  update  is  a  convex  optimization,  implemented  as 
a  Newton  step;  for  Bayesian  inference  we  sample  from  the  conditional  sampling 
distribution  (Equation  5.4)  using  Hessian  Metropolis-Hastings  (Algorithm  4). 

end 

If  the  hyperparameters  for  factor  U  e\  i.e.,  (//,,,  Ee),  are  not  fixed  then  compute  the 
normal-Inverse-Wishart  posterior  with  parameters  defined  in  Equation  5.1.  Use  the 
posterior  mode  for  MAP,  or  a  Gibbs  sample  for  Bayesian  inference. 

end 

t  =  t  +  1 


5.4  Experiments 

Mapping  fMRI  data  into  related  matrices:  In  this  section,  we  show  that  our  Bayesian 
approach  is  significantly  better  than  the  MAP  alternative,  using  the  fMRI+word  prob¬ 
lem  outlined  in  the  introduction.  The  data  collection  protocol  is  described  in  detail  in 
Mitchell  et  al.  [83].  The  stimuli,  the  shared  entities,  consist  of  word-picture  pairs  flashed 
on  a  screen  (e.g.,  bear,  bam,  pliers).  The  stimuli  are  chosen  to  be  representative  of  cat¬ 
egories  (e.g.,  animals,  buildings,  tools).  Nine  subjects  are  presented  with  each  of  sixty 
stimuli.  Each  patient  was  presented  with  each  stimulus  six  times.  For  each  stimulus  word 
we  generated  the  average  response  of  each  voxel  over  the  9x6  trials.10  This  proce¬ 
dure  results  in  the  Respons e(stitnulus,  voxel)  relation.  There  are  initially  >20,000 
voxels,  which  we  restrict  to  a  set  of  500  voxels  deemed  to  be  most  stable  [83].  The 
Co-occurs  (words,  stimulus)  relation  is  collected  by  measuring  whether  or  not  the 

10We  acknowledge,  but  do  not  correct  for,  registration  discrepancies  in  fMRI  imaging.  Each  brain  has 
a  different  shape,  which  must  be  mapped  onto  a  canonical  coordinate  system  (canonical  brain  map).  The 
fMRI  images  we  average  over  have  been  registered  (for  details,  refer  to  Mitchell  et  al.  [83]). 


92 


Algorithm  3:  Initialization  for  Hessian  Metropolis-Hastings 


Input:  Number  of  entities  of  type  e,  ne.  Prior  mean  and  covariance  £e. 

Output:  Initial  value  for  low -rank  factor  U^e,0\  Mean  and  negative  precision  matrix  for  the  first 
forward  sampling  distribution:  U-  K'0>  and  V2G 


for  i  =  1 . . .  nt  do 

Sample  from  the  prior:  LI. 


(e,0) 


'  Af  (/Ze,£e)- 

Choose  a  random  step  length:  r/  ~  U[0,l\. 

Compute  gradient  and  Hessian.  Estimate  posterior  mean  using  one  Newton  step: 

.  n  -i 


U 


(e,0) 


=  u 


(e,0) 


VO 


(^e’0))  V2  o 


end 


Algorithm  4:  Hessian  Metropolis-Hastings 


(e  t) 

Input:  Previous  sample  from  the  Markov  chain,  [/>.  ’  .  Observations  involving  entity  i. 
Output:  Next  sample:  u[e,t+1\  Mean  and  negative  precision  of  the  next  proposal. 


Sample  the  proposal:  U-e’*'>  ~  Af  (  Ui 


r(<M) 


-v2e> 


-1 


Compute  the  gradient  [V0(f7j.)]  and  Hessian  [V2  0(Ui.)\  at  U-e'*\ 
Estimate  the  posterior  mean  using  one  Newton  step  with  random  step  length: 


jjie >*)  _ 


V  ■ 


VO 


( ule’*})  V2  O  (u$e’*A  \  where  77  ~  U[ 0, 1], 


Compute  the  acceptance  probability 

a  =  min  <  1,  >  .  ., - 4-  x  — 7 - - - - - r4-  / . 

if  r  ~  U [0, 1]  <  a  then  k  =  *  else  k  =  t. 

Update:  u[e't+1)  =  U^k) ,  0e-t+1)  =  U^k\  \720  =  V20 


stimulus  word  occurs  within  five  tokens  of  a  word  found  in  the  Google  Teraword  cor¬ 
pus  [14].  Of  these  words,  we  select  20,000  at  random.  The  relations  map  into  two  matri¬ 
ces:  A"  =  Co-occurs  and  Y  =  Response.  Throughout  our  experiments  we  assume  that 
X ij  is  Bernoulli  distributed,  and  that  Yi3  is  Gaussian.  The  embedding  dimension  is  A  =  25 
throughout. 

Evaluation  Criteria:  The  first  type  of  experiment  is  hold-out  prediction:  the  entities 
are  fixed,  and  we  hold-out  entries  of  X  and  Y.  After  learning  the  model  T  or  \  'F .  0}, 
we  use  it  to  predict  the  held-out  entries  of  X  and  Y .  In  the  Bayesian  case,  we  average 
the  response  from  several  sampled  models.  The  second  type  of  experiment  is  fold-in 
prediction:  instead  of  holding  out  entries,  we  hold-out  entire  words  and  voxels.  After 
learning  a  model  using  the  remaining  rows  of  X  and  columns  of  Y,  we  “fold-in”  the  held- 
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out  entities.  Since  we  held-out  entities,  we  need  to  find  their  low-rank  representation  under 
each  model.  The  observations  involving  the  held-out  entity  are  split:  two-thirds  are  used 
to  fold  the  entity  into  the  model;  the  rest  are  used  to  estimate  test  error.  In  both  hold-out 
and  fold-in  experiments,  a  measure  of  test  error  is  required:  we  use  mean  squared  error,  a 
natural  choice  given  that  the  voxel  responses  are  modelled  as  Gaussians. 

Models  Compared:  We  compare  three  variants  of  Collective  Matrix  Factorization: 

>  Collective  Matrix  Factorization  (CMF),  where  the  model  is  the  most  likely  value  of  T , 
given  fixed  0.  We  discuss  how  0  is  chosen  below. 

>  Hierarchical  Collective  Matrix  Factorization  (H-CMF),  where  the  model  is  the  maxi¬ 
mizer  of  Equation  5.3  over  (JF,  0}. 

>  Hierarchical  Bayesian  Collective  Matrix  Factorization  (HB-CMF),  where  the  model  is 
the  posterior  distribution  over  (JF,  0}:  Equation  5.2.  Since  the  posterior  lacks  a  closed 
form,  we  cannot  directly  compute  the  posterior  predictive  distribution.  Instead,  predic¬ 
tions  are  made  using  samples  of  T  and  a  Monte  Carlo  estimate  of  the  posterior  predictive 
distribution. 

The  difference  between  CMF  and  H-CMF  is  the  addition  of  an  automatically  learned 
hierarchical  prior.  The  difference  between  H-CMF  and  HB-CMF  is  the  difference  be¬ 
tween  maximum  likelihood  and  Bayesian  inference  on  the  same  underlying  model  (Fig¬ 
ure  5.1(b)).  In  each  of  the  three  approaches,  folding-in  consists  using  the  mode  or  samples 
from  the  posterior  over  new  factors  to  make  predictions. 

For  CMF,  we  must  choose  a  diagonal  covariance  matrix  for  each  factor  F,  Y,F.  We 
propose  a  psychic  initialization  of  the  covariance:  let  the  jth  diagonal  element  of  be 
the  variance  of  the  jth  column  of  the  estimate  of  F  produced  by  H-CMF. 1 1  The  variants  of 
Collective  Matrix  Factorization  are  generalizations  of  ones  we  will  discuss  in  Chapter  6 
(see  Table  6.1):  on  the  Voxel+Word  data  CMF  =  MRMF  =  SoRec,  where  the  regularizer  is 
carefully  tuned;  on  the  Voxel-only  data  HB-CMF  «  BXPCA:  the  likelihoods  are  the  same, 
but  we  do  not  assume  a  conjugate  prior  over  V,  and  the  Gaussian  priors  have  diagonal 
covariance;  and  CMF  =  PMF,  except  that  we  use  full  covariance  matrices  for  the  Gaussian 
prior. 

The  results  for  the  hold-out  experiment  (Figure  5.2(a))  indicate  the  value  of  the  Bayesian 
approach.  The  only  difference  between  HB-CMF  and  H-CMF  is  that  the  former  is  learned 

"CMF  requires  a  good  fixed  prior  for  a  fair  comparison.  A  computationally  expensive  approach  for 
finding  a  fixed  prior  would  be  cross-validation.  Our  approach  is  computationally  simpler:  use  an  empirical 
estimate  of  the  prior  parameters. 
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1.4 


1.4 


1.2 


^■hb-cmf: 

□  H-CMF  • 
I  ICMF 


1.2 


:  ^Bhb-cmf 
□h-cmf 

I  ICMF 


(a)  Hold-out  (b)  Fold-in 

Figure  5.2:  Performance  on  predicting  Response  (stimulus,  voxel)  using  just  the 
voxel  response  (Voxel),  and  augmenting  it  with  word  co-occurrence  data  (Words  +  Voxels). 
The  bars  represent  algorithms  discussed:  CMF  (Chapter  4),  H-CMF  (Section  5.2),  HB- 
CMF  (Section  5.3).  The  error  bars  are  2-standard  deviations. 


(a)  Hold-out  (b)  Fold-in 


Figure  5.3:  Mixing  behavior  of  Algorithm  4.  The  slowest  mixing  instance  of  the  hold¬ 
out  and  fold-in  experiments  are  shown.  Each  point  on  the  energy  vs.  epochs  plots  (top) 
measures  the  loss  of  a  sample.  Each  point  on  the  test  error  vs.  time  plots  (bottom)  measures 
the  test  error  of  a  sample  on  predicting  word  co-occurrence  or  voxel  activation. 
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using  Bayesian  inference:  handling  posterior  uncertainty  leads  to  vastly  better  predictions. 
In  addition,  a  statistically  significant  improvement  is  achieved  by  augmenting  with  the 
word  data.  The  results  for  the  fold-in  experiment  (Figure  5.2(b))  show  an  even  larger 
difference  between  HB-CMF  and  its  competitors.  Perhaps  surprisingly,  we  are  able  to 
achieve  high  prediction  accuracy  when  testing  on  voxels  in  the  brain  that  never  appeared 
in  the  training  set.  Our  Bayesian  fold-in  procedure  was  able  to  infer  the  value  of  unob¬ 
served  voxels  simply  by  looking  at  their  correlations  with  existing  voxels.  The  training 
and  test  voxels  are  drawn  from  both  hemispheres,  with  weak  spatial  correlation — voxels 
from  regions  associated  with  visual  processing  tended  to  be  more  stable.  Again,  there  is  a 
statistically  significant  improvement  in  augmenting  the  data  with  word  co-occurrence. 

As  a  baseline  measure,  we  can  consider  the  mean  over  all  the  voxel  responses  (entries 
in  Y)  as  a  default  predictor.  Since  the  entries  in  the  Y  matrix  have  been  standardized  to 
zero  mean  Gaussians,  the  mean  predictor  is  =  0,  which  yields  a  mean  squared  error 
equal  to  the  variance,  i.e.,  1.  Comparing  HB-CMF  to  H-CMF,  where  the  only  difference  is 
learning  the  posterior  distribution  vs.  learning  the  posterior  mode,  there  is  a  clear  advan¬ 
tage  to  being  Bayesian. 

The  reader  may  rightly  question  whether  the  use  of  non-Gaussian  data  distributions 
is  worth  the  additional  complexity  of  Metropolis-Hastings  over  Gibbs  sampling.  If  we 
take  HB-CMF,  the  best  performing  algorithm,  and  replace  the  non-Gaussian  prediction 
links  with  Gaussian  ones,  prediction  accuracy  is  26%  worse  on  hold-out  prediction,  and 
39%  worse  on  fold-in  prediction.  Thus,  there  is  a  statistically  significant  benefit  to  using 
non-Gaussian  data  distributions  on  the  fMRI+word  data. 

Performance:  The  hierarchical  Bayesian  method  is  vastly  better  at  the  predictive  task 
than  its  non-Bayesian  analogues;  it  also  scales  well,  for  a  Bayesian  method.  Sampling 
the  parameters  and  hyperparameters  in  HB-CMF  takes,  on  average,  7.2s  for  the  hold-out 
experiment  using  a  parallel  implementation  on  four  processors.12  Our  implementation  was 
developed  in  MATLAB,  using  the  Distributed  Computing  Environment  toolkit  and  its  im¬ 
plementation  of  parallel-for.  An  analogous  iteration  of  H-CMF  takes  <  10s.  That  said, 
H-CMF  converges  in  less  than  20  iterations;  HB-CMF  takes  over  100  iterations  to  con¬ 
verge.  The  practicality  of  the  Bayesian  approach  depends  largely  on  how  many  iterations 
are  required  (Figure  5.3).  The  energy  vs.  epochs  plots  suggest  that  the  underlying  Markov 
chain  over  states  mixes  quickly. 

The  premise  of  this  dissertation  is  that  the  world  abounds  with  entities  that  are  related, 
but  not  by  any  readily  apparent  set  of  logical  rules.  Our  matrix  factorization  approach 
allows  for  information  integration  in  such  scenarios.  However,  we  are  limited  by  the  use  of 

12The  four  processors  are  cores  on  an  AMD  Opteron  2384  CPU,  operating  at  2.7GHz. 
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point  estimators.  A  Bayesian  alternative  is  desirable,  but  we  need  an  approach  that  scales 
to  large  matrices,  without  having  to  choose  between  artificial  conjugacy  assumptions  or 
painstaking  tuning  of  a  proposal  distribution.  By  decomposing  the  problem  into  cyclic 
updates  over  linear  models,  and  using  what  we  know  about  efficient  MAP  estimation,  we 
can  be  Bayesian  without  compromise:  the  model  need  not  be  simplified;  the  operator  need 
never  tune  a  proposal;  and  the  approach  scales  linearly  with  the  number  of  entities,  with 
almost  no  random  walk  behaviour  in  sampler. 
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Chapter  6 


Literature  Survey 


6.1  A  (very)  brief  overview  of  relational  learning 

Relational  learning  is  too  vast  in  scope,  and  too  varied  in  its  goals,  to  survey  to  more 
than  a  modicum  of  completeness.  Instead,  we  begin  by  briefly  discussing  approaches  to 
relational  data  that  we  believe  stand  apart  from  the  focus  of  this  thesis: 

>  Graph-based  techniques:  Graphs  can  be  used  to  encode  relational  data:  nodes  corre¬ 
spond  to  entities;  links  correspond  to  binary  relations.  Such  representations  deal  with 
data  where  there  is  one  arity-two  relation  involving  one  entity-type.  Even  in  the  single 
matrix  case,  we  allow  for  relations  involving  two  different  entity-types  (one  indexed 
by  rows,  another  indexed  by  columns).  In  collective  matrix  factorization,  we  allow 
for  multiple  types  of  relations,  involving  potentially  many  different  entity-types.  From 
the  perspective  of  flexibility  of  relational  representation,  the  closest  graph-based  ap¬ 
proaches  are  those  that  deal  with  attributed  graphs  (e.g.,  Tong  et  al.  [127],  Minkov  and 
Cohen  [82],  Singh  et  al.  [117]).  These  works  draw  much  of  their  inspiration  from  Page- 
Rank  [92],  which  uses  a  random  walk  over  a  graph  to  induce  rankings  of  nodes.  Even  in 
attributed  graph  mining  the  data  is  assumed  to  be  static:  observed  edges  are  a  fixed  set, 
and  nodes  do  not  appear  over  time.  Our  matrix-based  models  allow  for  relations/edges 
to  appear  during  training  (i.e.,  online  learning).  Moreover,  hierarchical  Bayesian  col¬ 
lective  matrix  factorization  defines  a  fully  generative  model  of  entities,  allowing  new 
entities/nodes  to  be  easily  added  to  the  data  set. 

>  Propositionalization:  Refers  to  approaches  that  first  convert  a  relational  data  set  into  a 
tabular,  or  attribute-value,  data  set  (e.g.,  Kramer  et  al.  [59]).  The  transformed  data  set  is 
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then  given  to  a  standard  attribute- value  learning  algorithm  for  prediction  or  clustering: 
e.g.,  logistic  regression.  Some  techniques,  like  structural  logistic  regression  [97],  treat 
propositionalization  as  part  of  the  learning  problem.  In  logic-based  representations  of 
the  data,  propositions  may  take  the  form  of  rules;  the  value  of  the  corresponding  at¬ 
tribute  is  true  or  false.  In  graph-based  representations  of  the  data,  propositions  may  take 
the  form  of  structural  properties  of  the  graph:  e.g.,  the  number  of  neighbouring  nodes 
that  satisfy  a  given  property,  or  the  hitting  time  of  a  random  walk  on  the  graph.  This  the¬ 
sis  is  primarily  concerned  with  predicting  the  value  of  a  relation.  Propositionalization 
removes  all  relations  from  the  data  set.  Ablating  relations  may  be  acceptable  if  our  goal 
is  to  predict  the  properties  of  entities  using  relations;  ablating  relations  is  not  acceptable 
if  our  goal  is  to  predict  properties  of  sets  of  entities,  i.e.,  relations. 

A  common  form  of  propositionalization  in  databases  is  denormalization,  where  multiple 
tables  in  a  database  are  joined  into  a  single  one.  If  any  of  the  relationships  in  the  original 
schema  are  one-to-many  or  many-to-many,  then  records  are  repeated  many  times  in 
the  join,  which  can  skew  the  marginal  distribution  of  an  attribute.  In  the  augmented 
collaborative  filtering  example,  Rat  ing (user,  movie)  and  HasGenr e (movie,  genre ) 
are  many-to-many  relations.  Were  the  relations  joined,  the  frequency  of  a  particular 
genre  would  depend  on  how  often  movies  of  a  particular  genre  are  rated.  If  many  of  the 
rated  movies  are  action  films,  then  the  marginal  distribution  of  genres  would  be  skewed 
towards  them,  regardless  of  how  common  action  films  are  among  the  set  of  movies. 
This  problem  is  known  as  relational  autocorrelation  [55]. 

>  Inductive  Logic  Programming:  First-order  logic  (FOL)  provides  a  particularly  expres¬ 
sive  language  for  relational  models.  The  earliest  approach  that  uses  FOL  for  relational 
learning  is  inductive  logic  programming  [86,  62],  which  is  a  non-probabilistic  approach 
to  relational  learning.  Probabilistic  extensions  of  ILP  exist  (e.g.,  [101]),  and  are  exam¬ 
ples  of  statistical  relational  models. 

Inductive  logic  programming  (ILP)  is  the  task  of  inducing  first-order  rules  from  ob¬ 
served  relations,  encoded  as  statements  in  first-order  predicate  logic,  a  logic  program. 
Once  a  logic  program,  usually  a  set  of  Horn  clauses,  is  inferred  one  can  deduce  new  facts 
about  entities  using  existing  rules  and  facts.  There  are  a  variety  of  techniques  for  ILP 
based  both  on  generalization  from  clauses  in  a  knowledge  base  [110,  111,  85,  86,  121], 
and  on  specialization  techniques,  which  build  larger  logic  programs  in  much  the  same 
fashion  as  decision  trees  build  larger  classification  rules  [115,  99]. 

Logic  programs  can  only  generalize  to  statements  that  can  be  deduced.  Our  concern 
is  with  predicting  statements  that  are  probable,  but  not  easily  deducible.  Hence  our 
interest  in  statistical  relational  models  (see  below). 
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>  Statistical  Relational  Models:  More  recently,  a  number  of  researchers  have  explored 
probabilistic  extensions  of  logical  representations,  using  these  extensions  for  what  we 
have  come  to  know  as  statistical  relational  learning  (cf.,  [40]).  The  most  high-level 
description  of  a  statistical  relational  model  is  based  on  the  possible  worlds  interpreta¬ 
tion  in  first-order  logic.  First-order  logic  defines  a  (possibly  infinite)  set  of  possible 
worlds,  each  of  which  is  assigned  a  truth  value.  A  statistical  relational  model  instead 
defines  a  distribution  over  possible  worlds,  typically  using  a  graphical  model  with  re¬ 
peated  structure.  Various  subsets  of  logic  provide  the  macro  language  for  defining  re¬ 
peated  structure  in  these  graphical  models.  For  example,  Probabilistic  Relational  Mod¬ 
els  [39]  can  be  viewed  as  a  generalization  of  Bayesian  networks  to  relational  databases, 
where  frame-based  logic  provides  the  macro  language  for  defining  the  Bayesian  net¬ 
work.  Markov  Logic  Networks  [105]  can  be  viewed  as  a  generalization  of  Markov 
Networks  to  relational  domains,  where  a  subset  of  first-order  logic  provides  the  macro 
language  for  defined  repeated  structure  within  the  distribution  over  possible  worlds. 
Bayesian  Logic  [79]  defines  a  declarative  language  for  defining  contingent  Bayesian 
networks,  allowing  one  to  reason  about  unknown  objects  using  an  infinite  set  of  possi¬ 
ble  worlds. 

A  recurring  theme  in  these  approaches  is  the  desire  to  represent  an  enormous  variety 
of  graphical  models.  For  example,  Probabilistic  Graphical  Models  subsume  discrete- 
variable  Bayesian  networks.  Markov  Logic  Networks  can  represent,  with  sufficiently 
many  variables,  any  discrete- variable  probability  distribution.1  While  we  laud  attempts 
to  create  high-level  languages  for  graphical  models,  the  focus  of  this  thesis  is  on  pro¬ 
viding  a  statistical  design  pattern  that  (i)  provides  a  simple,  restricted  interface  to  the 
modeler — i.e.,  sets  of  related  matrices;  (ii)  can  help  provide  guidance  to  modelers  in 
defining  repeated  structure,  especially  in  domains  where  defining  repeated  structure  us¬ 
ing  expert  knowledge  is  difficult  (e.g.,  consider  the  brain  imaging  example  in  Chapter  5, 
where  one  would  be  hard  pressed  to  encode  relationships  between  words  and  regions  of 
the  brain);  (iii)  allow  for  scalable  training  and  inference  algorithms  whose  implementa¬ 
tion  need  not  be  exposed  to  the  modeler. 


The  closest  areas  of  related  work  are  in  the  literature  on  matrix  factorization.  As  such,  we 
review  the  connections  between  collective  matrix  factorization  and  other  low-rank  matrix 
factorizations  below. 

'Markov  Logic  Networks  can  also  represent  distributions  over  real-valued  variables,  provided  that  there 
is  a  fixed-bit  representation  of  reals. 
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6.2  Connections  to  single-matrix  factorization 


There  are  two  ways  in  which  this  thesis  relates  to  the  (vast)  literature  on  low-rank  matrix 
factorization.  The  first  is  descriptive,  centering  on  our  unified  view  of  matrix  factorization 
in  Chapter  3.  The  second  is  algorithmic,  centering  on  our  work  on  efficient  algorithms  for 
maximum  likelihood  and  Bayesian  inference  matrix  factorization  models. 

Our  unified  view  of  matrix  factorizations  places  a  wide  variety  of  low-rank  matrix  fac¬ 
torizations  into  the  same  conceptual  framework,  reducing  each  matrix  factorization  to  a 
small  number  of  modeling  choices.  We  subsume  both  factor  analysis,  clustering,  and  ma¬ 
trix  co-clustering  techniques  into  the  same  formalism.  We  believe  that  this  is  a  significant 
contribution  to  understanding  the  literature  on  single-matrix  factorization.  Moreover,  if 
the  end-user  of  matrix  factorization  models  is  to  have  any  hope  of  choosing  from  the  va¬ 
riety  of  models,  they  must  first  understand  what  makes  the  models  different.  Ultimately, 
the  goal  of  a  statistical  design  pattern  is  to  formalize  the  expertise  that  goes  into  the  design 
and  implementation  of  a  class  of  graphical  models.  Our  hope  is  that  this  formalization 
of  low-rank  matrix  factorization  models  will  facilitate  their  use,  and  facilitate  transfer  of 
expertise  across  their  domains  of  use. 

Our  algorithmic  contributions  center  on  the  idea  that  matrix  factorizations  are  bi¬ 
linear  models,  or  sets  of  tied  bilinear  models  in  collective  matrix  factorization.  Bilin¬ 
ear  models  are  closely  related  to  linear  models,  and  we  can  exploit  decomposability  of 
the  loss  to  reduce  inference  into  cyclic  updates  over  tied  linear  models.  Reducing  in¬ 
ference  into  cyclic  updates  allows  us  to  exploit  componentwise  convexity.  In  contrast, 
other  techniques  for  solving  low-rank  matrix  factorization  usually  treat  training  as  a  non¬ 
linear  optimization  with  little  special  structure:  e.g.,  gradient  descent,  conjugate  gradi¬ 
ent,  expectation-maximization,  multiplicative  updates.2  In  the  case  of  max-margin  matrix 
factorization,  the  trace-norm  constraint  makes  the  loss  non-decomposable.  The  excep¬ 
tions  among  solvers  that  exploit  structure  in  low-rank  matrix  factorization  are  specialized 
techniques  for  SVD  (e.g.,  Gaussian  Elimination),  and  the  alternating  Bregman  projection 
approach  for  E-PCA  [26,  45]. 

The  idea  of  reducing  the  parameter  space  to  cyclic  updates  over  subsets  of  the  parame¬ 
ters  has  been  used  in  Bayesian  matrix  factorizations,  but  there  has  always  has  always  been 
a  trade-off  between  richness  of  the  models,  and  the  computational  cost  of  training: 

t>  Block  Gibbs  samplers  require  that  the  conditional  distribution  of  each  block  can  be 

Multiplicative  updates  are  largely  focused  on  non-negative  matrix  factorization,  and  its  variants.  In  the 
case  of  NMF,  it  can  be  shown  that  the  multiplicative  update  can  be  derived  from  gradient  descent  with  a 
particular,  parameter  dependent,  scaling  of  the  gradient  [63]. 
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sampled  from  exactly.  Most  frequently,  we  require  that  the  likelihood  and  prior  are 
conjugate,  as  is  the  case  with  Salakhutdinov  and  Mnih  [109]  (Gaussian  likelihood  and 
prior)  and  Buntine  and  Jakulin  [19]  (Poisson  likelihood,  Gamma  prior;  Poisson  like¬ 
lihood,  conditional  Gamma  prior;  Multinomial  likelihood,  Dirichlet  prior).  In  certain 
special  cases,  there  are  non-conjugate  pairs  of  distributions  whose  product  can  be  sam¬ 
pled  from  exactly:  e.g.,  the  product  of  a  Gaussian  and  an  Exponential  distribution  is 
a  rectified  Normal  distribution,  which  allows  for  a  Bayesian  extension  of  non-negative 
matrix  factorization  [114]. 

The  requirement  of  exact  conditional  distributions  on  each  block  becomes  more  severe 
when  dealing  with  multiple  matrices,  each  with  their  own  data  distribution.  To  update 
the  shared  factor  with  Gibbs,  the  data  likelihoods,  as  well  as  the  priors,  must  be  mutually 
conjugate. 

>  Hamiltonian  Monte  Carlo:  Hamiltonian  Monte  Carlo  (a.k.a.  Hybrid  Monte  Carlo)  [88, 
74]  is  a  technique  that  utilizes  the  gradient  of  the  distribution  being  modelled  to  improve 
the  mixing  rate.  HMC  is  used  in  a  Bayesian  generalization  of  E-PCA  [84].  In  this 
Bayesian  version  of  E-PCA,  the  prior  over  rows  of  U  is,  like  our  work,  multivariate 
Gaussian;  and  the  prior  over  rows  of  V  is  chosen  to  be  conjugate  to  the  likelihood. 
A  hierarchical  prior  is  placed  over  rows  of  U ,  but  not  V.  The  hyperprior  over  the 
multivariate  Gaussian  is  a  restricted  form  of  the  Inverse-Wishart:  the  covariance  matrix 
must  be  diagonal. 

The  use  of  Hamiltonian  Monte  Carlo  requires  that  the  variables  must  be  unconstrained, 
which  requires  careful  use  of  change-of-variable  transforms  to  handle  covariance  matri¬ 
ces.  An  advantage  of  our  block  adaptive  MCMC  approach  is  that  we  can  exploit  partial 
Hessian  structure,  namely,  the  per-row  Hessians.  Unlike  Mohamed  et  al.  [84],  we  use 
partial  Hessian  information  in  addition  to  the  gradient. 

>  Variational  Bayes  is  a  form  of  approximate  inference,  whereas  the  adaptive  Metropolis- 
Hastings  sampler  we  propose  is  asymptotically  exact.  Variational  inference  can  make 
claims  about  runtime  independent  of  the  data  presented;  but  the  quality  of  the  approx¬ 
imation  depends  on  the  data  matrix  X.  In  contrast,  our  adaptive  Metropolis-Hastings 
approach  cannot  make  claims  about  runtime  independent  of  the  data;  but  we  can  make 
guarantees  with  respect  to  the  quality  of  the  approximation  which  can  be  achieved: 
asymptotically,  our  approach  is  exact.  When  a  variational  method  performs  poorly,  it  is 
hard  to  disambiguate  whether  this  is  due  to  the  model  or  the  quality  of  the  approxima¬ 
tion.  With  MCMC,  we  can  test  whether  the  chain  has  mixed  (although  this  is  non-trivial 
to  do  in  practice).  Once  we  are  confident  that  the  chain  has  mixed,  and  that  we  have 
enough  samples  from  the  posterior,  poor  results  can  be  readily  attributed  to  the  model. 
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The  broader  objection  to  variational  methods  is  that  every  time  the  likelihood  and  prior 
change,  a  new  lower  bound  must  be  derived.  More  choices  of  models  equals  more  work 
for  the  modeler,  both  in  deriving  the  bound  and  writing  code  to  optimize  the  bound. 

The  lower-bound  used  in  variational  inference  can  be  designed  to  take  advantage  of 
decomposability  of  the  loss,  reducing  training  to  updates  over  rows  of  factor  matrices. 
For  example,  VBSVD  [67],  is  a  Bayesian  analogue  to  weighted  SVD.  Their  variational 
Bayesian  solution  assumes  that  the  posterior  distribution  over  each  row  of  U  and  V  is 
Gaussian:  the  update  equations  are  over  each  row  of  each  factor.  The  quality  of  the 
approximations  depends,  in  part,  on  how  close  the  per-row  posteriors  are  to  Gaussian. 

If  we  consider  each  of  these  Bayesian  methods  and  their  maximum  likelihood  analogue,  it 
becomes  clear  that  there  is  a  tradeoff  between  (i)  representational  flexibility  of  the  model, 
(ii)  asymptotically  exact  vs.  approximate  inference,  and  (iii)  human  effort  in  designing  the 
training  algorithm.  In  constrast,  our  adaptive  Metropolis-Hastings  approach  (i)  is  represen- 
tationally  flexible:  every  maximum  likelihood  variant  of  CMF  we  discuss  has  a  Bayesian 
analogue  which  can  be  trained  using  our  approach;  (ii)  scales  to  relatively  large  data  sets, 
such  as  the  fMRI  problem  we  discuss  in  Chapter  5;  (iii)  our  algorithm  works  for  a  vari¬ 
ety  of  models:  one  implementation  allows  us  to  train  on  many  different  combinations  of 
likelihood  and  prior.  These  properties  are  especially  desirable  when  we  consider  sets  of 
related  matrices.  Conjugacy  in  the  multiple-matrix  setting  for  HB-CMF  requires  mutual 
conjugacy  between  the  priors  and  the  likelihoods,  restricting  the  combinations  of  distri¬ 
butions  we  can  use  to  model  Xl3  and  Y]r .  If  an  entity-type  participates  in  many  different 
relations,  the  mutual  conjugacy  problem  can  be  highly  restrictive. 

Discrete  Component  Analysis  (DC A)  [19]  is  an  attempt  to  unify  Bayesian  matrix  fac¬ 
torization  models  into  a  unified  view,  much  as  we  did  for  maximum  likelihood  low-rank 
matrix  factorizations  in  Chapter  3.  DC  A  assumes  that  the  prior  and  likelihood  are  con¬ 
jugate,  but  they  consider  the  possibility  that  rows  of  the  data  matrix  are  draws  from  a 
multinomial  distribution: 

X ~  Multinomialff/,. V/7j. 

The  multinomial  rows  assumption  is  shared  with  Latent  Dirichlet  Allocation  [8].  In  addi¬ 
tion  to  variational  approximation  and  Gibbs  sampling,  the  use  of  conjugate  priors  makes 
Rao-Blackwellization  within  Gibbs  practical.  The  use  of  conjugate  priors  allows  for  Rao- 
Blackwellization  within  Gibbs.  Rao-Blackwellization  within  Gibbs  is  based  on  the  idea 
that  the  unknown  variables  (in  the  case  of  HB-CMF,  (JF,  0})  can  be  divided  into  two 
groups,  such  that  sampling  from  one  given  the  other  is  analytically  tractable.  In  the  case 
of  HB-CMF,  either  p(jF  |  0)  or  p(Q  |  T)  would  have  to  be  analytic.  We  are  not  aware  of 
an  analytic  form  for  either  distribution. 
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Hierarchical  priors  are  common  to  many  approaches  to  Bayesian  matrix  factorization. 
They  are  essential  to  the  design  of  a  Bayesian  analogue  of  ICA  (c.f.,  Buntine  and  Jakulin 
[16]).  Latent  Dirichlet  Allocation  [8]  and  Multinomial  PC  A  [17]  utilize  hierarchical  priors 
to  pool  information  across  documents.  H-CMF,  which  is  hierarchical  but  estimated  under 
maximum  likelihood,  generalizes  unconstrained  Probabilistic  Matrix  Factorization  [108] 
to  non-Gaussian  prediction  links.  The  difference  in  our  approach  is  that  we  make  no 
conjugacy  assumptions,  which  allows  HB-CMF  to  be  easily  generalized  to  include  more 
complex  types  of  priors  (e.g.,  Laplace  priors,  which  encourage  sparsity  in  the  same  manner 
as  G -regularization). 


6.3  Connections  to  multiple-matrix  factorization 

We  do  not  claim  to  be  the  first  to  proposed  parameter  tying  in  matrix  factorization,  though 
we  believe  generalizing  E-PCA  to  multiple  matrices  is  far  more  flexible  than  techniques 
that  generalize  a  matrix  factorization  which  assumes  a  particular  response  type.  We  con¬ 
tend  that  Collective  Matrix  Factorization  and  its  variants  provide  a  flexible  way  of  mod¬ 
elling  sets  of  related  matrices,  in  a  manner  that  is  useful  for  information  integration. 

To  be  useful  for  information  integration,  we  need  to  be  able  to  scale  our  models  to 
handle  a  relatively  large  number  of  entities,  which  poses  an  algorithmic  problem.  A  great 
deal  of  this  thesis  is  concerned  with  algorithmic  contributions  that  make  Collective  Matrix 
Factorization  and  its  variants  practical  on  data  sets  with  large  numbers  of  entities. 

We  begin  by  disambiguating  our  work  from  relational  co-clustering,  which  assumes 
that  our  primary  concern  is  with  clustering  entities.  Later,  we  consider  multiple-matrix 
factorization  models  that  we  consider  more  closely  related  to  Collective  Matrix  Factoriza¬ 
tion. 


6.3.1  Relational  Co-Clustering 

Relational  (co-)clustering  refers  to  models  on  sets  of  related  matrices  or  tensors  where  the 
underlying  factorization  involves  clustering  or  co-clustering  of  the  rows  and  columns  (or 
higher  dimensions,  in  the  case  of  tensors).  This  thesis  focuses  on  predicting  the  value  of 
a  relation,  not  clustering  entities.  There  is  no  prima  facie  reason  to  favour  clustering  over 
factor  analysis  in  prediction;  but  there  are  computational  reasons  to  favour  factor  analysis 
over  clustering.  We  view  the  relational  co-clustering  techniques  below  as  complementary 
to  Collective  Matrix  Factorization  (i.e.,  focused  on  entity  clustering  in  relational  data,  as 
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opposed  to  relation  prediction): 

>  Multi-type  relational  clustering:  A  symmetric  block  model  X<13)  rs  CfA^Cj  where 
Ci  G  {0,  l}n*xfc  and  C3  G  {0,  l}nfeXfc  are  cluster  indicator  matrices,  and  G  Wkxk 
contains  the  predicted  output  for  each  combination  of  row  and  column  clusters. 

Early  work  on  this  model  uses  a  spectral  relaxation  specific  to  squared  loss  [70],  while 
later  generalizations  to  regular  exponential  families  [72]  use  EM.  An  equivalent  formu¬ 
lation  in  terms  of  regular  Bregman  divergences  [71]  uses  iterative  majorization  [64,  123] 
as  the  inner  loop  of  alternating  projection. 

>  Relational  multi-way  clustering  [6]:  A  generalization  of  Bregman  co-clustering  of  a 
matrix  to  sets  of  related  tensors.  Each  entity-type  has  a  low-rank  representation,  and 
an  alternating  optimization  approach  is  used  to  determine  a  hard  of  soft  clustering  for 
each  entity-type,  cyclically  till  convergence.  Instead  of  a  Bregman  projection,  the  basic 
operation  is  a  clustering  of  the  entities  under  Bregman  information,  the  expected  Breg¬ 
man  divergence.  Essentially,  the  clustering  of  entities  is  a  generalization  of  hard  or  soft 
A; -means  clustering,  where  Euclidean  distance  can  be  replaced  by  a  variety  of  informa¬ 
tion  theoretic  quantities,  such  as  the  mutual  information  between  entities  and  cluster 
exemplars.  The  entity  clustering  is  a  point-estimate.  In  our  work,  the  basic  projection 
operation  is  a  convex  optimization  instead  of  a  cluster  assignment  problem.  Unlike  our 
work,  Banerjee  et  al.  [6]  considers  sets  of  tied  tensors,  allowing  for  higher-arity  rela¬ 
tions. 

>  Extensions  of  pLSI  and  LDA:  In  the  text  modeling  literature,  multiple  matrix  factoriza¬ 
tions  have  centered  on  the  assumption  that  there  is,  at  the  core,  a  matrix  representing  the 
relationship  Count  (document,  word),  where  each  row  of  the  corresponding  data  ma¬ 
trix  is  modelled  as  a  multinomial  distribution  over  words.  A  multinomial  row  assump¬ 
tion  is  true  of  both  probabilistic  Latent  Semantic  Indexing  [52]  and  Latent  Dirichlet 
Allocation  [8].  pLSI-pHITS  [24]  augments  the  Count  relationship  with  a  citation  rela¬ 
tion  between  documents,  Cites  (document,  document).  The  data  reduces  to  a  pair  of 
related  matrices,  each  of  which  is  modelled  by  pLSI:  the  distribution  of  topics  for  each 
document  is  shared.  We  compared  pLSI-pHITS  against  our  approach  in  Section  4.5.3. 

pLSI  and  its  multi-matrix  extensions  suffer  from  the  fold-in  problem  (Section  3.9). 
Bayesian  alternatives  like  LDA  do  not  suffer  from  the  fold-in  problem.  Recently,  there 
have  been  several  extensions  of  Latent  Dirichlet  Allocation  which  augmented  the  bag- 
of-words  representation  with  side  information,  either  by  tying  hyperparameters  together 
(e.g.,  [87]),  or  by  tying  parameters  to  a  matrix  factorization  like  E-PCA  [21].  All  of 
these  techniques  assume  that  the  primary  source  of  data  is  the  Count  relation,  which  is 
well-modelled  by  the  multinomial  link;  we  do  not. 
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Table  6.1:  Comparison  of  single-  and  multiple-  matrix  factorization  models  which  model 
each  matrix  entry  as  a  draw  from  an  regular  exponential  family.  We  refer  the  reader  to 
Section  6.3.2  for  a  description  of  the  columns. 


Algorithm 

#M 

Bayes? 

Prior/Reg 

Hier 

Exp 

Optimizer 

PMF  [108] 

1 

MLE 

Gaussian  (SC) 

/ 

/ 

Gradient  descent 

E-PCA  [26] 

1 

MLE 

None 

X 

/ 

Alt.  Bregman 

BMF  [109] 

1 

Bayes 

Gaussian 

/ 

X 

Gibbs  sampling 

BXPCA  [84] 

1 

Bayes 

Multiple 

/ 

/ 

Hybrid  MC 

VBSVD  [67] 

1 

Bayes 

Gaussian  (DC) 

/ 

X 

Variational  Bayes 

SoRec  [73] 

2 

MLE 

Gaussian  (SC) 

X 

X 

Gradient  descent 

PMDC  [132] 

2+ 

Bayes 

Gaussian  (DC) 

/ 

X 

Variational  Bayes 

MRMF  [68] 

2+ 

MLE 

Gaussian  (SC) 

X 

X 

Gradient  descent 

CMF  [116] 

2+ 

MLE 

Gaussian  (SC) 

X 

/ 

Alternating  Newton 

H-CMF 

2+ 

MLE 

Gaussian 

/ 

/ 

Alternating  Newton 

HB-CMF 

2+ 

Bayes 

Gaussian 

/ 

/ 

Adaptive  MCMC 

6.3.2  Related  Factor  Analysis  Models 

The  most  closely  related  single-  and  multiple-  matrix  factorizations  models  are  those  that 
do  not  make  a  clustering  or  co-clustering  assumption,  but  do  assume  that  each  entry  of 
the  data  matrix  can  be  modelled  as  a  draw  from  an  exponential  family  distribution.  We 
summarize  the  most  closely  related  single-  and  multiple-  matrix  factorization  models  in 
Table  6.1.  We  compare  the  number  of  matrices  involved  (#M),  whether  or  not  the  method 
is  Bayesian  (Bayes?),  the  type  of  prior  (Prior/Reg),  whether  the  prior  is  learned  automat¬ 
ically  (Hier),  whether  or  not  the  entries  can  be  modelled  using  any  one-parameter  expo¬ 
nential  family  (Exp),  and  the  optimization  techniques  used  (Optimizer).  If  the  prior  is 
Gaussian,  we  note  whether  the  covariance  is  spherical  (SC)  or  diagonal  (DC).  If  neither 
SC  or  DC  is  annotated,  the  form  of  the  prior  covariance  is  unconstrained.  It  should  be 
noted  that  even  small  differences  in  the  features  we  consider  can  dramatically  increase  the 
difficulty  of  learning. 

The  three  variants  of  collective  matrix  factorization  are  the  most  flexible.  We  cover 
both  maximum  likelihood  and  Bayesian  inference,  automatically  estimated  hierarchical 
priors  and  flat  ones,  as  well  as  Gaussian  and  non-Gaussian  data  distributions.  The  only 
other  technique  which  takes  advantage  of  row-column  exchangeability  to  compute  per- 
row  Hessians  is  variational  Bayes  (used  by  VBSVD,  PMDC).  In  VBSVD  and  PMDC, 
the  quality  of  the  variational  Bayes  approximation  to  the  posterior  becomes  better  as  the 
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posterior  over  each  factor  row  becomes  more  Gaussian,  and  as  the  factor  rows  become 
more  independent. 

The  closest  competing  approach  to  our  Bayesian  model  is  PMDC,  which  is  restricted 
to  modeling  entries  of  a  matrix  with  a  Gaussian.  The  variational  Bayesian  solution  is  based 
on  approximating  the  posterior  over  each  row  of  a  factor  as  a  multivariate  Gaussian.  Our 
approach  similarly  benefits  from  using  a  Gaussian  distribution  (as  a  proposal  distribution) 
without  introducing  the  biases  that  variational  methods  can  introduce.  Moreover,  when¬ 
ever  the  exponential  family  used  to  model  matrix  entries  changes,  a  new  variational  bound 
must  be  derived.  Our  approach  applies  to  a  wide  variety  of  exponential  families  without 
change. 

Our  decomposition  approach  is  greatly  informed  by  BMF:  we  sample  the  hyperparam¬ 
eters  using  the  same  Gibbs  step.  However,  we  can  model  sets  of  related  matrices  without 
requiring  that  the  predictions  be  Gaussian.  The  Hybrid  Monte  Carlo  sampler  in  BXPCA 
uses  the  gradient  of  the  likelihood;  we  use  the  gradient  and  the  per-row  Hessians. 

The  three-factor  schema  £1  ~  £2  ~  £3  also  includes  supervised  matrix  factorization. 
In  this  problem,  the  goal  is  to  classify  entities  of  type  £2:  matrix  X[Vl)  contains  class  la¬ 
bels  according  to  one  or  more  related  concepts  (one  concept  per  row),  while  X('2:>,!  lists 
the  features  of  each  entity.  An  example  of  a  supervised  matrix  factorization  algorithm  is 
the  support  vector  decomposition  machine  [95]:  in  SVDMs,  the  features  Xm)  are  fac¬ 
tored  under  squared  loss,  while  the  labels  X(  VI)  are  factored  under  Hinge  loss.  Supervised 
dimensionality  reduction  [106]  is  replaces  the  hinge  and  squared  losses  with  Bregman  di¬ 
vergences,  training  the  model  parameters  using  EM.  A  similar  model  was  proposed  by 
Zhu  et  al.  [135],  using  a  once-differentiable  variant  of  the  Hinge  loss.  Another  example  is 
supervised  LSI  [133],  which  factors  both  the  data  and  label  matrices  under  squared  loss, 
with  an  orthogonality  constraint  on  the  shared  factors.  Principal  components  analysis, 
which  factors  a  doubly  centered  matrix  under  squared  loss,  has  also  been  extended  to  the 
three-factor  schema  [134]. 

Recursive  Attribute  Factoring  [25]  is  unique  in  its  approach  to  multiple-matrix  fac¬ 
torization  (where  one  matrix  consists  of  document-word  counts,  and  another  contains 
document-document  links).  Instead  of  simultaneously  finding  a  low-rank  representing 
documents  and  words,  they  recursively  factor  the  matrices:  each  low -rank  model  of  docu¬ 
ments  is  appended  to  the  document-word  counts,  and  the  augmented  matrix  becomes  the 
data  for  another  round  of  factorization.  Unlike  our  work,  they  factor  exclusively  under 
squared- loss. 

In  Chapter  3  we  discussed  max-margin  matrix  factorization  (MMMF)  as  a  single¬ 
matrix  factorization,  where  the  matrix  is  the  Rat  inq(user,  movie)  relation.  However, 
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it  can  also  be  viewed  as  a  kind  of  max-margin  matrix  factorization.  The  key  intuition 
behind  MMMF  is  that  it  treats  an  rating  as  an  ordinal  value:  Rating  e  {1,2,...,  R\ 
and  1  <  2  <  . . .  <  R.  Fast  MMMF  reduces  the  prediction  of  an  entry  to  a  set  of  binary 
threshold  problems:  namely,  if  =  Rat  ingfusery,  movie j),  predicting  >  1 ,  r%3  > 
2 , ,rij  >  R  —  1.  If  we  use  a  smooth  loss  (such  as  log-loss)  for  each  of  these  binary 
predictions  and  add  the  losses  together,  the  result  is  equivalent  to  a  collective  matrix  factor¬ 
ization  where  £\  are  users,  £2  are  movies,  and  £1  £2  for  u  =  1 ...  R  —  1  are  the  binary 

rating  prediction  data:  Xp’f'1  =  1  iff  user  p  assigns  a  rating  greater  than  r  to  movie  q.  A 
single  matrix  of  data  weights  can  be  used  to  mask  missing  values  in  all  R  —  1  relations. 

In  order  to  predict  different  values  for  the  R  —  1  different  relations,  we  need  to  allow 
the  latent  factors  U W  and  Ui2)  to  contain  some  untied  columns,  i. e. ,  columns  which  are 
not  shared  among  relations.  For  example,  the  MMMF  authors  have  suggested  adding  a 
bias  term  for  each  rating  level  or  for  each  (user,  rating  level)  pair.  To  get  a  bias  for  each 
(user,  rating  level)  pair,  we  can  append  R  —  1  untied  columns  to  £/n),  and  have  each  of 
these  columns  multiply  a  fixed  column  of  ones  in  U ^ .  To  get  a  shared  bias  for  each  rating 
level,  we  can  do  the  same,  but  constrain  each  of  the  untied  columns  in  U ^  to  be  a  multiple 
of  the  all-ones  vector. 


6.4  Conclusions 

The  role  we  envision  for  collective  matrix  factorization  is  that  of  a  statistical  design  pattern 
for  information  integration,  a  class  of  graphical  models  where  the  complexities  of  training 
and  inference  are  hidden,  exposing  the  user  only  to  sets  of  related  matrices,  with  a  small 
number  of  tunable  parameters  (such  as  k,  the  number  of  embedding  dimensions). 

Other  approaches  to  statistical  relational  learning  provide  a  high-level  interface  for  cre¬ 
ating  much  larger  classes  of  graphical  models,  albeit  at  the  cost  of  losing  computationally 
useful  structure.  That,  combined  with  the  fact  that  we  build  upon  matrix  factorization, 
makes  it  natural  to  compare  our  work  to  other  matrix  factorization  methods.  Our  basic 
claim  is  that  even  though  sets  of  related  matrices  are  a  simple  representation,  and  tying 
low-rank  matrix  factorizations  is  a  straightforward  idea,  the  combination  of  these  ideas 
cover  a  fairly  rich  class  of  successful  models  in  the  machine  learning  literature. 
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Chapter  7 

Future  Work  and  Discussion 


We  put  before  the  reader  the  claim  that  Collective  Matrix  Factorization,  and  its  variants,  are 
a  statistical  design  pattern  for  information  integration  problems.  In  Section  7.1  we  describe 
potential  extensions  to  Collective  Matrix  Factorization,  that  speak  to  the  limitations  and 
ambitions  felt  most  acutely  during  our  research  in  this  area.  We  have  already  described 
the  main  contributions  of  this  thesis  in  Chapter  1,  and  so  instead  close  with  a  discussion 
of  why  we  believe  these  contributions  are  of  significance. 


7.1  Future  Work 

7.1.1  Higher  Arity  Relations 

The  most  obvious  limitation  of  our  approach  is  that  each  relation  must  have  exactly  two 
arguments,  which  allows  for  the  encoding  of  attributes  and  typed  links,  but  not  relations 
with  three  or  more  arguments.  Banerjee  et  al.  [6]  has  considered  the  problem  in  the  context 
of  relational  clustering,  though  using  point  estimation.  Alternating  projection  algorithms 
has  long  been  used  in  tensor  factorization,  and  we  believe  that  extending  our  work  to 
tensors  would  be  relatively  straightforward. 


7.1.2  Distributed  Models 

We  have  discussed  the  parallel  implementation  of  a  projection,  but  Collective  Matrix  Fac¬ 
torization  could  also  be  reasonably  adapted  to  the  distributed  setting,  where  the  data  and 
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parameters  are  split  up  across  multiple  machines.1  All  we  require  to  update  the  low-rank 
parameters  for  an  entity  are  the  current  values  of  the  low-rank  parameters,  the  observed 
relations  the  entity  is  involved  in,  and  the  factors  that  represent  the  other  argument  in 
relations  the  entity  is  involved  in.2  Entities,  and  their  low-rank  representations,  can  be  dis¬ 
tributed  across  multiple  machines,  along  with  copies  of  the  necessary  rows  and/or  columns 
of  the  data  matrices.  All  one  need  ever  communicate  between  machines  are  factor  rows.  It 
remains  to  be  shown,  though,  how  one  can  efficiently  distribute  data  and  parameters  across 
multiple  machines,  so  that  the  computational  and  communication  costs  of  training  are  not 
focused  on  a  small  subset  of  the  machines. 


7.1.3  Conditional  Models 

We  have  concerned  ourselves  with  generative  models,  where  we  model  the  joint  distribu¬ 
tion  of  the  parameters,  hyperparameters,  and  all  data  matrices.  If  we  are  interested  only 
in  predicting  the  value  of  one  relation,  then  introducing  parameters  to  model  the  other 
relations  can  be  both  statistically  and  computationally  inefficient.  One  way  of  addressing 
these  concerns  may  be  through  the  use  of  discriminative  models:  instead  of  modeling  the 
joint  distribution  of  the  inputs  and  outputs,  model  the  conditional  distribution  of  the  out¬ 
puts  given  the  inputs.  In  the  two  matrix  case,  the  task  would  involve  modelling  p(X  \  Y ) 
or  p(Y  |  X )  instead  of  p(X.  Y).  The  application  domains  we  have  considered  lend  them¬ 
selves  to  discriminative  models.  In  the  fMRI  domain,  we  may  be  most  interested  in  pre¬ 
dicting  voxel  activation  given  word  co-occurrence  data.  In  the  movie  rating  domain,  we 
may  be  most  interested  in  predicting  user’s  ratings  given  side  information  about  movies. 

The  closest  related  work  on  discriminative  models  is  based  on  Canonical  Correlation 
Analysis  (CCA)  [53],  which  views  the  data  matrix  as  a  collection  of  exchangeable  rows  or 
columns.  Given  two  collections  of  related  data  points  (rows  or  columns),  we  can  imagine  a 
regression  model,  for  each  data  set,  that  projects  each  set  on  points  onto  an  unknown  scalar 
quantity,  which  we  call  canonical  variates.  Canonical  Correlation  Analysis  involves  find¬ 
ing  regression  parameters  for  each  data  set  such  that  the  correlation  between  the  canonical 
variates  is  maximized.  If  we  draw  out  CCA  as  a  graphical  model,  as  in  Bach  and  Jordan 
[4],  it  becomes  clear  that  the  regression  weights  are  a  latent  representation  for  exchange¬ 
able  data,  and  that  the  regression  weights  are  tied  in  a  specific  fashion.  A  generalization 

1  We  note  that  timing  experiments  for  the  hierarchical  Bayesian  model  (Chapter  5)  are  based  on  a  parallel 
implementation  using  four  processors  in  a  shared  memory  configuration.  However,  we  have  not  considered 
the  distributed  setting,  where  communication  constraints  between  processors  are  much  more  of  a  concern. 

technically,  all  that  is  required  are  the  rows  of  the  other  factor  that  correspond  to  observed  entries  in  a 
data  matrix. 
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of  CCA  to  multiple  related  matrices  has  been  used  to  integrate  data  from  multiple  fMRI 
studies,  improving  the  quality  of  a  linear  regression  model  of  brain  activation  [107]. 

The  multiple  CCA  approach  assumes  a  star  schema:  a  single  common  entity  type  E\ , 
which  is  related  to  the  other  entity  types:  E\  ~  Vi  >  1.  When  the  schema  is  not  a  star, 
conditional  relational  models  become  problematic.  Conditional  distributions  usually  only 
include  parameters  between  the  output  (relation  we  want  to  predict)  and  the  inputs  (other 
relations).  Parameters  are  not  included  to  model  the  relations  between  the  input  relations. 
One  potential  direction  for  defining  a  conditional  Collective  Matrix  Factorization  model 
would  be  to  keep  the  existing  generative  structure,  but  leam  the  parameters  discrimina- 
tively  (e.g.,  by  maximizing  a  conditional  likelihood  instead  of  the  likelihood).  However, 
maximizing  conditional  likelihood  does  not  have  a  straightforward  Bayesian  analogue. 

Another  problematic  aspect  of  conditional  models  for  matrix  data  is  that  they  assume 
exchangeability  of  each  matrix,  but  not  row-column  exchangeability.  Developing  condi¬ 
tional  models  on  row-column  exchangeable  models  remains  a  topic  for  future  work. 


7.1.4  Relational  Active  Learning 

Information  integration  involves  aggregating  disparate  sources  of  information,  some  of 
which  are  easier  to  acquire  than  others.  In  the  augmented  collaborative  filtering  problem, 
collecting  side  information  about  movies  is  easy  and  inexpensive;  collecting  ratings  from 
people  is  time  consuming.  Large-scale  rating  data,  such  as  the  Netflix  data  set  [89],  is  the 
side  product  of  a  company’s  business.  In  the  augmented  brain  imaging  problem,  word  co¬ 
occurrence  data  is  cheap  to  collect;  brain  imaging  data  is  expensive  to  collect.  fMRI  data 
collection  requires  human  subjects,  access  to  a  fMRI  device,  and  domain  experts  to  process 
the  data.  While  Collective  Matrix  Factorization  can  improve  predictions  on  models  of 
expensive  data,  can  we  use  our  models  to  guide  the  process  of  collecting  expensive  data? 

Active  learning  is  a  formalism  where  the  learning  algorithm  can  select  training  obser¬ 
vations,  instead  of  passively  using  data  drawn  from  a  fixed,  unknown,  probability  distri¬ 
bution  (a.k.a.  passive  learning).  The  advantage  of  active  learning  over  passive  learning, 
which  has  been  the  focus  of  this  thesis,  is  that  the  learning  algorithm  can  select  obser¬ 
vations  which  are  more  likely  to  improve  predictive  accuracy.  Active  learning  is  usually 
framed  as  a  value  of  information  problem.  The  learning  algorithm  has  a  function  which 
can  assign  a  value  to  every  possible  subset  of  observations.3  The  goal  is  to  find  the  best 

3Often,  value  of  information  functions  are  based  on  an  information  theoretic  property,  such  as  mutual 
information  (c.f.,  Guestrin  et  al.  [46],  Krause  et  al.  [60]).  Since  Collective  Matrix  Factorization  has  been 
framed  as  a  probabilistic  graphical  model,  such  information  theoretic  functions  are  easy  to  define. 
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possible  subset  of  observations. 

There  are  a  number  of  questions  that  arise  when  one  considers  active  learning  within  a 
relational  model:  Can  we  exploit  relations  between  expensive  and  inexpensive  data  sources 
to  reduce  the  cost  of  data  collection,  while  still  producing  highly  predictive  models?  How 
do  relations  guide  the  process  of  selecting  maximally  informative  objects?  What  is  the 
appropriate  tradeoff  between  observing  properties  of  an  object  and  the  relations  it  partic¬ 
ipates  in?  What  is  the  relative  difficulty  of  this  problem  in  the  adaptive  scenario,  where 
the  model  incorporates  data  as  it  is  selected,  versus  the  one-shot  scenario,  where  we  must 
choose  all  observations  to  make  before  seeing  any  of  their  results? 

7.1.5  Temporal-Relational  Models 

Collective  Matrix  Factorization  assumes  that  observations  are  scalars: 

Relat ion(x, y)  G  M. 

In  many  cases,  observations  are  better  represented  as  time-varying  quantities: 

Relat ion(x, y)  =  (vt)J=1,  vt  G  M. 

For  example,  while  we  have  treated  fMRI  measures  of  voxel  activation  as  scalar  quantities, 
there  are  other  techniques  for  measuring  brain  activity  at  high  frequency,  such  as  magne¬ 
toencephalography  (MEG).  The  voxel  response  relation  for  MEG  data  would  be  better 
described  by  Respons  efstimulus,  voxel):  each  observation  is  a  time  series. 

Were  one  to  replace  each  random  variable  that  models  a  scalar  observation  in  Collec¬ 
tive  Matrix  Factorization  with  a  time  series,  similar  to  a  Hidden  Markov  Model  or  Kalman 
Filter,  the  end  result  would  be  to  tie  the  hidden  state  and/or  dynamics  of  time  series  models 
together.  Predictions  of  an  observed  relation  (time  series)  would  depend  not  only  on  its 
own  past,  but  also  on  the  past  of  other,  correlated,  temporal  observations. 

One  could  imagine  the  utility  of  alternating  optimization  approaches,  where  instead  of 
projection  the  basic  operation  is  training  a  time  series  model.  Such  an  approach  would 
also  provide  an  alternative  to  modeling  time  as  a  dimension  in  matrix  factorization,  which 
violates  the  row-column  exchangeability  assumption. 


7.2  Why  this  thesis  matters 

Statistical  relational  learning  is  a  vast  topic,  perhaps  too  vast.  Rarely  does  one  see  re¬ 
searchers  talking  about  propositional  or  attribute-value  learning  as  a  monolithic  field  of 
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study.  Rather,  we  typically  refine  our  discussion  of  attribute-value  learning  along  different 
axes:  by  task  abstraction  (clustering,  prediction);  by  learning  abstraction  (batch,  online, 
active);  and  by  application  domains  (in  natural  language  processing,  in  robotics).  It  is  our 
belief  that  a  similar  refinement  of  statistical  relational  learning  is  valuable;  information  in¬ 
tegration  is  a  refinement  of  relational  learning  by  task  abstraction.  Refining  our  discussion 
to  focus  on  information  integration  is  valuable  because  (i)  such  problems  appear  in  many 
different  domains;  (ii)  we  have  developed  easy-to-use  techniques  that  address  information 
integration,  or  provide  the  basis  for  a  more  elaborate  solution;  (iii)  while  easy-to-use,  our 
techniques  are  backed  by  sophisticated,  scalable  algorithms  for  learning  and  inference. 

From  a  statistical  perspective,  the  goal  of  relational  learning  is  ambitious.  The  vast  ma¬ 
jority  of  statistics  and  machine  learning  holds  exchangeability  of  data  points  as  a  paradig¬ 
matic  assumption.  However,  the  whole  purpose  of  relational  learning  is  to  exploit  correla¬ 
tions  due  to  violations  of  exchangeability.  We  have  shown  that  one  can  have  a  relational 
model  that  exploits  exchangeability  within  a  relation  and  correlations  between  relations. 

Even  though  relational  data  violates  some  of  the  most  basic  assumptions  of  common 
statistical  models,  we  can  build  elaborate  models  of  relational  data  from  very  simple,  very 
familiar,  building  blocks.  Sets  of  related  matrices  can  represent  a  large  swath  of  relational 
data  sets,  and  this  representation  is  not  especially  complicated.  Given  a  set  of  related 
matrices,  tying  low -rank  parameters  together  is  not  an  especially  complicated  idea.  Given 
a  large,  non-convex  optimization,  alternating  projections  (a.k.a.  block  coordinate  descent) 
is  not  an  especially  elaborate  solver.  Given  a  projection,  recognizing  independence  among 
subsets  of  random  variables  is  not  unheard  of.  And  finally,  we  have  in  our  hands  the  basic 
operation:  train  a  linear  model,  in  our  case,  a  generalized  linear  model,  or  its  Bayesian 
analogue.  Now,  if  there  is  one  class  of  models  we  understand  well,  and  know  how  to 
scale,  it  is  linear  models. 

While  the  individual  steps  above  are  simple,  we  believe  that  our  combination  leads  to 
an  approach  more  powerful  than  is  suggested  by  the  sum  of  parts.  Keeping  things  simple 
does  have  it  advantages. 
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Appendix  A 


Newton  Update  for  the  Three 
Entity-Type  Model 

A.l  Derivation  of  the  3-entity- type  model 

The  3-entity-type  model  consists  of  two  relationship  matrices,  a?nxn  matrix  X  and  a  nxr 
matrix  Y.  In  the  spirit  of  generalized  linear  models  we  define  the  low  rank  representations 
of  the  relationship  matrices  to  be  X  sa  fi(UVT)  and  Y  ~  f2(VZT )  where  fi  :  Kmx"  — ■> 
lmx"  and  f2  :  Mnxr  — >  Mnxr  are  the  prediction  links1,  and  U  G  Mmxfc,  V  G  Mnxfc, 
and  Z  G  Wxk  are  the  parameters  of  the  model  for  k  G  Z++,  a  positive  integer.  We 
further  define  functions  G  :  M.mxk  —>  Mmxfc,  H  :  Wixk  Wixk,  I  :  Wxk  — >  Mrxfc,  to 
model  prior  knowledge  about  our  parameters,  e.g.,  regularizes.  We  additionally  require 
the  convex  conjugate, 


G*(U)  =  sup  [UoA-G(A)\, 

U  edom(y4) 

where  U o  A  =  tr(UT A)  =  V ;  UijAij  is  the  matrix  dot  product.  The  overall  loss  function 
for  our  model  is 

L(U,V,Z\W,W)  =  aLx(U,  V\W)  +  (1  -  ot)L2(V,  Z\W)  (A.l) 

'Throughout  this  appendix  we  assume  that  prediction  links  /  and  the  corresponding  loss  defined  by  F 
are  decomposable,  i.e.,  F  :  ]R  — >  R  and  /  :  K  — >  K  are  applied  element-wise  when  their  arguments  are 
matrices. 
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where  we  introduce  fixed  weight  matrices  for  the  observations,  W  G  Mmx"  and  W  G 
Mnxr.  The  individual  objectives  on  the  reconstruction  of  X  and  Y  are,  respectively, 


L^U.VIW)  =  (W  oFi  (UVT)  -  (WO  X)  oUVT)  +  G*  (U)  +  H*(V)  (A.2) 


L2(V,Z\W)  =  (WoF2(VZt)  -  (W  QY)  oVZT^j  +  H*(V)  +  I*(Z)  (A.3) 


where  F\  and  F>  are  element-wise  functions  over  matrices,  i. e. ,  L  \  and  L2  are  each  decom¬ 
posable.  The  scalars  a,b,c  >  0  control  the  strength  of  regularization  on  the  factors  (the 
larger  the  value,  the  weaker  the  regularizer).  The  objective  L(U,  V,  Z |  W.  W)  is  convex  in 
any  one  of  its  arguments,  but  is  in  general  non-convex  in  all  its  arguments.  As  such,  we 
use  an  alternating  minimization  scheme  that  optimizes  one  factor  U,  V,  or  Z  at  a  time. 
This  appendix  describes  the  derivation  of  both  gradient  and  Newton  update  rules  for  U,  V, 
and  Z.  For  completeness,  Section  A.  1.1  reviews  useful  definitions  from  matrix  calculus. 
The  gradient  of  the  objective,  with  respect  to  each  argument  is  derived  in  Section  A.  1.2. 
Finally,  by  assuming  the  loss  is  decomposable,  we  derive  the  Newton  update  in  Section 
A.  1.3,  whose  additional  cost  over  its  gradient  analogue  is  essentially  a  factor  of  k  times 
more  expensive. 

A.1.1  Matrix  Calculus 

For  the  sake  of  completeness  we  define  matrix  derivatives,  which  generalize  both  scalar 
and  vector  derivatives.  Using  this  definition  of  matrix  derivatives,  we  also  generalize  the 
scalar  chain  and  product  rules  to  matrices.  The  discussion  herein  is  based  on  Magnus 
et.  al.  [75,76], 

Let  M  be  an  n  x  q  matrix  of  variables,  where  m.j  denotes  the  j-th  column  of  M .  The 
vec-operator  yields  a  nq  x  1  matrix  that  stacks  the  columns  of  M : 


(  ra.A 


m.  2 


vec  M 


\m.qJ 
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While  there  are  several  common  (and  incompatible)  definitions  of  matrix  derivatives,  the 
derivative  of  a  n  x  1  vector  /  with  respect  toamxl  vector  x  is  almost  universally  defined 
as 


Df(x) 


df_ 

dxT 


/dh  dh  \ 

dx\  '  '  '  dxm 


dfn  dfn 

\  dx\  '  '  '  0xrn  / 


The  matrix  derivative  of  an  m  x  p  matrix  function  p  of  an  n  x  q  matrix  of  variables  M 
contains  mnpq  partial  derivatives,  and  the  matrix  derivative  arranges  these  partial  deriva¬ 
tives  into  a  matrix.  We  define  the  matrix  derivative  by  coercing  matrices  into  vectors,  and 
using  the  above  definition  of  the  vector  derivative: 


d[y(M)}  11  d[<f(M)}  n  \ 

dmn  '  '  '  dm-nq 


9W(M)\ 

mp  d[<p(M)\mp  I 

dm ii  ‘  ‘  ‘  dmnq  ) 

which  is  an  mp  x  nq  matrix  of  partial  derivatives.  This  definition  encompasses  vector  and 
scalar  derivatives  as  special  cases.  The  advantages  of  this  formulation  include  (i)  unam¬ 
biguous  definitions  for  the  product  and  chain  rules,  and  (ii)  we  can  easily  convert  Dp(M) 
to  the  more  common  definition  of  the  matrix  derivative,  d^P ,  via  the  first  identification 
theorem  [76,  ch.  9].  We  additionally  require  the  Kronecker  product,  and  the  matrix  chain 
[76,  pg.  121]  and  matrix  product  [75]  rules: 


^  dvecip(M) 

DV(M)  =  0(vec  M)T 


Definition  6  (Kronecker  Product).  Given  a  m  x  p  matrix  Q  and  a  n  x  q  matrix  M  the 
Kronecker  product,  Q  ®  M,  is  a  mn  x  pq  matrix: 


Q®  M 


(QnM 

\QmlM 


QipM\ 

QmpM 


Definition  7  (Matrix  Chain  Rule).  Given  functions  ip  :  Mnx,?  — >  Mmxp,  ip1  :  M/:xr  — > 
Mmxp,  and  <p2  '■  Mnx<?  — >  M^xr  the  derivative  ofp(M)  =  p>i(Y),  where  Y  =  p2(M),  is 

Dip(M)  =  Dpi(Y)  x  Dtp2(M). 

where  A  x  B  (or  equivalently,  AB)  is  the  matrix  product. 

Definition  8  (Matrix  Product  Rule).  Given  an  m  x p  matrix  cpi(M),  apxr  matrix  cp2(M), 
and  a  i  x  s  matrix  M,  the  derivative  of  p i(M)  x  tp2(M)  is 

D(p i  x  p2 ){M)  =  ( p2(M)T  ®  Im)  x  D<pi(M)  +  (Ir  ®  cpi (M))  x  Dp2(M ), 

where  A®  B  is  the  Kronecker  product. 
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A.1.2  Computing  the  Gradient 


To  compute  the  derivative  of  Equation  A.  1  with  respect  to  U,  V,  and  Z  we  require  the 
following  three  lemmas: 


Lemma  1.  For  any  differentiable  function  F  :  M  — >  M 


d  (W  o  F{UVt )) 
dU 

where  f  =  VF. 


(W  0  f(UVT))  V , 


d  (W  o  F{UVt)) 
dV 


(W  Q  f(UVT))T  U. 


Proof.  We  only  derive  the  result  for  <p(U)  =  F(UVr),  the  proof  is  similar  for  the  other 
case.  <p(U)  can  be  expressed  as  the  composition  of  functions:  (p(U)  =  F(Y),  Y  =  yffU), 
<P2 (U)  =  UVT.  We  note  that 


DF(Y) 


d  vec  W  o  F{UVT) 

9  (vec  UVT)T 

( "Yu  u-,, \ 

V  9Yu  ■  ■  ■  dYmn  ) 

(Wufilh.V?)  ...  Wmnf(Um.VZ)) 
(vec  W  ©  f(UVT))T  . 


Furthermore,  using  the  matrix  product  rule 


D<p2(U) 


( V  0  Im)  x 


d  vec  f/ 
<9(vec  f/)T 


+  (4  0  f/)  x 


vec  4T 
(9  (vec  4)T 


(4  0/m). 


(A.4) 


(A. 5) 


Combining  equations  A.4  and  A. 5  using  the  matrix  chain  rule  and  the  identity  (vec  A)T  x 
(5  0/)  =  (vec  AB)t  yields  D<p(U)  =  DF(Y)  x  Dip2{U)  =  (vec  (W  ©  f(UVT ))  E)T. 
The  first  identification  theorem  [76,  ch.  9],  which  formally  states  the  rules  for  rearranging 
entries  of  D(p(U)  to  get  dPv&^d,v  1);  completes  the  proof.  □ 

Lemma  2.  For  fixed  matrices  W  and  X  and  variable  matrices  U  and  V 

=  {w 0 x)v,  fhFXl  =  (w 0 X)TV. 
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Proof.  We  derive  the  result  for  d(X  o  UVT)/dV,  the  other  derivation  is  similar.  To  avoid 
a  long  digression  into  matrix  differentials,  we  prove  the  result  by  element-wise  differenti¬ 
ation.  Noting  that 


X O  uvT  =  EE  E  'U'jk'Viki 

i  j  k 

we  compute  the  derivative  with  respect  to  vpq: 

d((W  ®X)oUVT)  d  ^ 

—  2_^  WPX]i  W~—  2_^  UjkVik 
i  J  pq  h 

=  WjpXjpUjq  =  ((IT  ©  A")  £/)  . 


dv, 


pq 


Where  the  final  step  follows  from  the  fact  that  the  derivative  is  zero  unless  (i ,  j )  = 
(p,  q ).  Since  the  result  holds  for  all  p  e  {1, . . . ,  n}  and  q  e  {1, . . . ,  k}  it  follows  that 

d{X  o  UVT)/dV  =  XTU.  □ 


Lemma  3.  For  the  l2  regularize rs  where  a,b,c  >  0  controls  the  strength  of  regularizes 
(larger  values  =>■  weaker  regularization): 


G(U) 


H(V) 


b\\V\\  Fro 


I(Z)  = 


ci\Z 


2 

Fro 


2 


? 


the  derivatives  for  the  convex  conjugates  are 


dG*(U)  _U_  dH*(V)  _  V  _  dI*(Z)  _  Z 
dU  ’  dV  ~J~  ’  dZ  “7 


Proof  The  result  is  easily  proven  by  finding  the  convex  conjugate  and  differentiating  it. 

□ 


Combining  Lemmas  1,2,  and  3,  and  denoting  the  Hadamard  (element-wise)  product  of 
matrices  0  ©  M,  we  have  that  for  J\  —  Vi7)  and  /2  =  Vf2 

dL{Ud^Z)  =a(W®(f1(UVT)-X))V  +  A,  (A.  6) 

dL{UdyZ)  =  a  (W  ©  (. h(UVT )  -  X)f  U+(l-a)(WQ  (/2(TZt)  -  ©))  Z  +  B, 

(A. 7) 

dL(Uf^'Z)  =  (1  _  a)  (w  ©  {f2(VZT)  -Y))TV  +  C.  (A. 8) 


121 


Setting  the  gradient  equal  to  zero  yields  update  equations  either  for  A,  B,  C  or  for  U,  V, 
Z.  An  advantage  of  using  a  gradient  update  is  that  we  can  relax  the  requirement  that  the 
links  are  differentiable,  replacing  gradients  with  sub  gradients  in  the  prequel. 


A.1.3  Computing  the  Newton  Update 

One  may  be  satisfied  with  a  gradient  step.  However,  the  assumption  of  a  decomposable 
loss  means  that  most  second  derivatives  are  set  to  zero,  and  a  Newton  update  can  be  done 
efficiently,  reducing  to  row-wise  optimization  of  U,  V,  and  Z.  For  the  subclass  of  models 
where  Equations  A.6-A.8  are  differentiable  and  the  loss  is  decomposable  define, 

q(Ui.)  =  a  (Wt.  O  (f1(Ui.VT)  -  X ,.))  U  +  A, 

q(Vi.)  =  a  (Wt.  ©  (fi(UV?)  -  X,.))T  U  +  (1  -  a)  (wt.  ©  (/2(F,Zr)  -  U))  Z  + 
q(Zi.)  =  (1  -  a)  (lE,  ©  {f2{VZl)  -  Y.^  V  +  Q.. 

Any  local  optimum  of  the  loss  corresponds  to  a  simultaneous  root  of  the  functions  {g(f/i.)}™=1, 
{g(U-)}r=1,  and  {q{Zi.)}ri= i-  Using  a  Newton  step,  the  update  for  Ut.  is 

Ur  =  U,-,1[q(U,)\[q\Ul.)]-\  (A.9) 

where  rj  e  (0, 1]  is  the  step  length,  chosen  using  line  search  with  the  Armijo  criterion  [90, 
ch.  3].  The  Newton  steps  for  V%.  and  Z%.  are  analogous.  To  describe  the  Hessians  of  the 
loss,  q' ,  we  introduce  the  following  notation  for  the  Hessians  of  G*,  H*  and  I*: 

Gi  =  diag(V2G*(£4))  =  diag(a_1l), 

Hi  =  diag(V2^*(yi.))  =  diag(6-1l), 

Ii  =  diag(V2/*(A-))  —  diag(C-1l). 

For  conciseness  we  also  introduce  the  following  terms: 

Dlti  =  diag(lFj.  ©  /((U,UT)),  D2,  =  diag(fF.i  ©  /[(UV?)), 

Ds,i  =  diag(lFl.  0  f'2(y?Z)),  D4.t  =  diag(lF,  ©  /'(UZj)). 

This  allows  us  to  describe  the  Hessians  of  Equation  A.  1  with  respect  to  each  row  of  the 
parameter  matrices. 
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Lemma  4.  The  Hessians  of  Equation  A.l  with  respect  to  Ui„  V,.y  and  Z%.  are 

q'(Ui.)  =  =  aVTDhiV  +  Gu 

q'(Zi- )  =  =  (1  -  a)VTD^V  +  f, 

q'ty.)  =  =  aUTD2tlU  +  (1  -  ot)ZT D^Z  +  H%. 

Proof.  We  prove  the  result  for  qfU,),  noting  that  the  other  derivations  are  similar.  Since 
q(-)  and  its  argument  Ut.  are  both  vectors,  DqUf.)  is  identical  to  dq^U,)/ dU,..  Ignoring 
the  terms  that  do  not  vary  with  U,., 


Dq{Ui)  =  D  [a(Wi.  ©  /( Ut.VT))V  +  A,.] 

=  aD  [(Wi.  ©  f(Ui. VT))V]  +  DA,. 

<9vec(W,  ©/([/*. HT)) 
<9(vec  U,)T 


+  {h®(wi.ef(ui.vT )))  x 


=  a  |( V 1  ©  1)  x 

dvecVG*(Ui.) 

+  <9(vec  U,)T 

=  a  {(VT®  1)  x  DUV  +  (4  ©  (Wt.  ©  f(Ui.VT )))  xO}  +  Gt 
=  aVTDliV  +  Gi. 


dvec  V 
d(vecUi.)T 


The  introduction  of  DuV  above  follows  from  observing  that 

<9vec(Wj.  0  f(Ui.VT))\  _  d(WipQf(Ut.VpT )) 


d  vec  U,. 


p,q 


dU, 


Wipf(U,VpT)Vpq, 


iq 


which  is  simply  the  scaling  of  the  pth  row  of  V  by  c oip.  □ 

An  alternate  form  for  the  updates,  known  as  the  adjusted  dependent  variates  form,  can 
be  derived  by  plugging  the  gradient  q(Ui.)  and  the  Hessian  q'(Ui.)  into  Equation  A. 9: 

£/TY(£4)  =  Ui.(aVT Di  ,V  +  Gi)  +  aV  (W,.  ©  (X,  -  /([/,. VT)))  V  -  VA,.  (A.10) 
=  aUi.VTDljiV  +  aq  (W.,  0  (X,  -  /(£/,.  HT)))  D^DlpV  +  U^G,  -  r/A, 
=  a  ( U,.VT  +  v  (W,.  ©  (X,.  -  /(f/,XT))  D-j)  DlpV  +  U.,Gt  -  r/A, 
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Likewise  for  Z,., 

Z«™q'(Z,)  =  (1  -  a)  (zvVT  +  rj  (w<  0  (X?;  -  f{V2?)))T  DA,V  +  Z,/,  -  r/C'. 

(A.  11) 

The  derivation  of  the  update  for  V%.  is  similar,  since  L(U,  V,  Z\W,  W)  is  a  linear  combina¬ 
tion  and  the  differential  operator  is  linear: 

^neV (K)  =  a  {  (  Vr.U T  +  rj  (IT,  ©  (X,  -  f(UV?)))T  D~^  D^U }  +  (A.12) 

(1  -  a)  {  (Vi.ZT  +  77  (Wi.  ©  (n  -  /(I4ZT)))  D3-1)  L>3i,z}  + 

1©  II,  -  rjBi. 

While  D  e  { = !  may  not  be  invertible,  i.  e. ,  a  diagonal  entry  is  zero  when  the  corre¬ 
sponding  weight  is  zero,  the  form  of  the  update  equations  shows  that  this  does  not  matter. 
If  a  diagonal  entry  in  D  is  zero,  replacing  its  corresponding  entry  in  I)  1  by  any  nonzero 
value  does  not  change  the  result  of  Equations  A. 10,  A.ll,  and  A.12,  as  the  zero  weight 
cancels  it  out. 


A.2  Generalized  and  Standard  Bregman  Divergence 

We  prove  that  the  Generalized  Bregman  divergence  between  9,  x  6  M, 

Bf(0||x)  =  F(8)  +  F*(x)  -Ox, 
is  equivalent  to  the  standard  definition  [15,  20], 

Dp.ixwm)  =  F\x)  -  F*(f(9 ))  -  VF*(f(9))(x  -  f(8)), 

when  F  is  convex,  twice-differentiable,  closed  and  f  =  V F  is  invertible.  By  the  definition 
of  the  convex  conjugate, 

0^(0  ||  x)  —  F{9)  +  sup  {< j>x  —  F ((/))}  —  6x, 

<t> 

where  at  the  supremum  9  =  f  -\x).  Therefore, 

H)f(9  ||  x)  —  F{9)  +  /_1(a;)  •  x  —  F(/_1(a;))  —  9x 
=  F(9)-F(r1(x))-x(9-f-1(x)) 

=  F(9)  -  F(f-\x ))  -  VF(r\x))(9  -  f-\x)) 

=  DF(9\\f-\x)). 
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For  the  above  choice  of  F,  DF{6  \  |  /_1(a:))  =  DF*  (x  \  \  f(0)).  When  F  is  the  log-partition 
function  of  a  regular  exponential  family,  it  satisfies  the  properties  required  for  the  equiva¬ 
lence  between  generalized  and  standard  Bregman  divergences. 
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